Monday, February 23, 2009

Exploring a ‘Deep Web’ That Google Can’t Grasp

From: http://www.nytimes.com

Jeffrey D. Allred for The New York Times

At the University of Utah, Prof. Juliana Freire is working on DeepPeep, an ambitious effort to index every public database online.

One day last summer, Google’s search engine trundled quietly past a milestone. It added the one trillionth address to the list of Web pages it knows about. But as impossibly big as that number may seem, it represents only a fraction of the entire Web.

Beyond those trillion pages lies an even vaster Web of hidden data: financial information, shopping catalogs, flight schedules, medical research and all kinds of other material stored in databases that remain largely invisible to search engines.

The challenges that the major search engines face in penetrating this so-called Deep Web go a long way toward explaining why they still can’t provide satisfying answers to questions like “What’s the best fare from New York to London next Thursday?” The answers are readily available — if only the search engines knew how to find them.

Now a new breed of technologies is taking shape that will extend the reach of search engines into the Web’s hidden corners. When that happens, it will do more than just improve the quality of search results — it may ultimately reshape the way many companies do business online.

Search engines rely on programs known as crawlers (or spiders) that gather information by following the trails of hyperlinks that tie the Web together. While that approach works well for the pages that make up the surface Web, these programs have a harder time penetrating databases that are set up to respond to typed queries.

“The crawlable Web is the tip of the iceberg,” says Anand Rajaraman, co-founder of Kosmix (www.kosmix.com), a Deep Web search start-up whose investors include Jeffrey P. Bezos, chief executive of Amazon.com. Kosmix has developed software that matches searches with the databases most likely to yield relevant information, then returns an overview of the topic drawn from multiple sources.

“Most search engines try to help you find a needle in a haystack,” Mr. Rajaraman said, “but what we’re trying to do is help you explore the haystack.”

That haystack is infinitely large. With millions of databases connected to the Web, and endless possible permutations of search terms, there is simply no way for any search engine — no matter how powerful — to sift through every possible combination of data on the fly.

To extract meaningful data from the Deep Web, search engines have to analyze users’ search terms and figure out how to broker those queries to particular databases. For example, if a user types in “Rembrandt,” the search engine needs to know which databases are most likely to contain information about art ( say, museum catalogs or auction houses), and what kinds of queries those databases will accept.

That approach may sound straightforward in theory, but in practice the vast variety of database structures and possible search terms poses a thorny computational challenge.

“This is the most interesting data integration problem imaginable,” says Alon Halevy, a former computer science professor at the University of Washington who is now leading a team at Google that is trying to solve the Deep Web conundrum.

Google’s Deep Web search strategy involves sending out a program to analyze the contents of every database it encounters. For example, if the search engine finds a page with a form related to fine art, it starts guessing likely search terms — “Rembrandt,” “Picasso,” “Vermeer” and so on — until one of those terms returns a match. The search engine then analyzes the results and develops a predictive model of what the database contains.

In a similar vein, Prof. Juliana Freire at the University of Utah is working on an ambitious project called DeepPeep (www.deeppeep.org) that eventually aims to crawl and index every database on the public Web. Extracting the contents of so many far-flung data sets requires a sophisticated kind of computational guessing game.

“The naïve way would be to query all the words in the dictionary,” Ms. Freire said. Instead, DeepPeep starts by posing a small number of sample queries, “so we can then use that to build up our understanding of the databases and choose which words to search.”

Based on that analysis, the program then fires off automated search terms in an effort to dislodge as much data as possible. Ms. Freire claims that her approach retrieves better than 90 percent of the content stored in any given database. Ms. Freire’s work has recently attracted overtures from one of the major search engine companies.

As the major search engines start to experiment with incorporating Deep Web content into their search results, they must figure out how to present different kinds of data without overcomplicating their pages. This poses a particular quandary for Google, which has long resisted the temptation to make significant changes to its tried-and-true search results format.

“Google faces a real challenge,” said Chris Sherman, executive editor of the Web site Search Engine Land. “They want to make the experience better, but they have to be supercautious with making changes for fear of alienating their users.”

Beyond the realm of consumer searches, Deep Web technologies may eventually let businesses use data in new ways. For example, a health site could cross-reference data from pharmaceutical companies with the latest findings from medical researchers, or a local news site could extend its coverage by letting users tap into public records stored in government databases.

This level of data integration could eventually point the way toward something like the Semantic Web, the much-promoted — but so far unrealized — vision of a Web of interconnected data. Deep Web technologies hold the promise of achieving similar benefits at a much lower cost, by automating the process of analyzing database structures and cross-referencing the results.

“The huge thing is the ability to connect disparate data sources,” said Mike Bergman, a computer scientist and consultant who is credited with coining the term Deep Web. Mr. Bergman said the long-term impact of Deep Web search had more to do with transforming business than with satisfying the whims of Web surfers.

Library Value Calculator

The Denver Public library has a Library Value Calculator on their website. It was mentioned in this story about a person who decided not to buy any books for one year. (Article indicates they are a librarian)

Excerpt: There are several reasons I stopped buying books in 2008. With a young child at home, a car payment and student loans, saving money was becoming more important to me than owning "Zazie in the Metro" or "Tamerlane: Sword of Islam." As a librarian I also saw a limit on book buying as an opportunity to enrich my professional life by experiencing the library more fully as a patron. Finally, part of me just wanted to see if I could do it.

In the comments to the article there is this comment: Thank you Denver Post, for printing an article that damages further an already damaged industry, the local book shop. While libraries have their places, we desperately need the few surviving bookshops. How about an article boosting the buying of books?

The Bookcave



Friday, February 20, 2009

Koran and Bible Moved To the Top Shelves in U.K.

Posted by

In this instance, Dewey's system is not part of the equation. It seems that officials at UK libraries have recommended keeping all holy books, including the Bible and the Koran on the top shelves in the interests of equality.

Leicester's librarians consulted the Federation of Muslim Organisations and were advised that all religious texts should be kept on the top shelf to ensure equality.

But there are critics of the new requirements; Robert Whelan of the Civitas think-tank told The Daily Mail: "Libraries and museums are not places of worship. They should not be run in accordance with particular religious beliefs.

Christian.org UK argues that Christians do not apply such beliefs to the Bible, which they say should be easily accessible for everyone.

More from Telegraph UK and opinion (unorthodox to say the least) from Damian Thompson of the Telegraph.

Friday, February 13, 2009

Syriac bible found in Cyprus

Authorities in northern Cyprus believe they have found an ancient version of the Bible written in Syriac, a dialect of the native language of Jesus.

The manuscript was found in a police raid on suspected antiquity smugglers. Turkish Cypriot police testified in a court hearing they believe the manuscript could be about 2,000 years old.

The manuscript carries excerpts of the Bible written in gold lettering on vellum and loosely strung together, photos provided to Reuters showed. One page carries a drawing of a tree, and another eight lines of Syriac script.

Experts were however divided over the provenance of the manuscript, and whether it was an original, which would render it priceless, or a fake.

Experts said the use of gold lettering on the manuscript was likely to date it later than 2,000 years.

"I'd suspect that it is most likely to be less than 1,000 years old," leading expert Peter Williams, Warden of Tyndale House, University of Cambridge told Reuters.

Turkish Cypriot authorities seized the relic last week and nine individuals are in custody pending further investigations. More individuals are being sought in connection with the find, they said.

Further investigations turned up a prayer statue and a stone carving of Jesus believed to be from a church in the Turkish held north, as well as dynamite.

The police have charged the detainees with smuggling antiquities, illegal excavations and the possession of explosives.

Syriac is a dialect of Aramaic - the native language of Jesus - once spoken across much of the Middle East and Central Asia. It is used wherever there are Syrian Christians and still survives in the Syrian Orthodox Church in India.

Aramaic is still used in religious rituals of Maronite Christians in Cyprus.

"One very likely source (of the manuscript) could be the Tur-Abdin area of Turkey, where there is still a Syriac speaking community," Charlotte Roueche, Professor of Late Antique and Byzantine Studies at King's College London told Reuters.

Stories regarding the antiquity of manuscripts is commonplace. One case would be the Yonan Codex, carbon dated to the 12th century which people tried to pass off as earlier.

After further scrutiny of photographs of the book, manuscripts specialist at the University of Cambridge library and Fellow of Wolfson College JF Coakley suggested that the book could have been written a good deal later.

"The Syriac writing seems to be in the East Syriac script with vowel points, and you do not find such manuscripts before about the 15th century.

"On the basis of the one photo...if I'm not mistaken some words at least seem to be in modern Syriac, a language that was not written down until the mid-19th century," he told Reuters.

Win a trip to the ALA Annual Conference in Chicago

Enter Here

Monday, February 09, 2009

Libraries can keep books with lead-containing ink

By Liz Szabo, USA TODAY

Librarians won't have to throw away their children's books after all on Tuesday, when a sweeping new product safety law takes effect.

The law, passed in August, dramatically cuts the amount of lead and other chemicals allowed in kids' products. That had librarians worried, because some books made before the 1980s had ink that contained lead.

ANTI-LEAD LAW: Safety rules on lead in kids' products perplex and polarize
STUDY: Children's lead levels, SAT scores linked
IN-DEPTH: For many kids, lead threat is right in their own homes

Today, the Consumer Product Safety Commission, the federal agency in charge of enforcing the law, announced that it won't prosecute anyone for distributing "ordinary" children's books printed after 1985. These books have never been found to violate the new lead standards, which will mandate that kids' items contain no more than 600 parts per million beginning Tuesday. The standards get twice as tough in August when the limit will be 300 parts per million.

Given the way that kids tear and chew through library books, congressional staff involved in the legislation says it's unlikely that libraries have many children's books that are more than 24 years old.

Thursday, February 05, 2009

Book burning on Feb. 10th 2009 due to CPSIA

Book in Flames

The Consumer Product Safety Improvement Act (CPSIA H.R. 4040) has a good goal: protect kids from dangerous imports tainted with lead. Bravo! Unfortunately it goes about doing so in such a way that it’ll drive up costs across the board, drive many manufacturers and retailers out of business, and not really make kids any safer.

So what does CPSIA do? It mandates lead testing for ALL items intended for children under 13 or PERCEIVED as being for those under age 13. So items commonly regarded as “kids stuff” even if it is intended for adults, such as many comics, collectible books, high end popups, etc, still falls under the statute even though they’re aimed at adult collectors.

It requires UNIT testing. The final product must be tested from each batch. It doesn’t matter if all the components going into it are certified and have been tested as having no lead, it still must be tested for lead.

Here’s an example. You publish textbooks for 4th graders. You publish a science textbook. You publish a spelling book. They are printed with all the same materials, on the same day, on the same press, with the same crew manning it. You must test the science book and the spelling book separately because they may contain lead!

This basically seems to imply that somehow alchemy works. Non-lead containing item + non-lead containing item= LEAD!

The manufacturer needs to provide a testing certificate to the retailer, which must be available for inspection, should a Consumer Product Safety Administration inspector come in. No certificate, the retailer can’t sell it.

The truly bizarre part is that the new regulations apply retroactively. Even if it was printed 50 years ago and the publisher no longer exists, you need to have a certificate proving it’s not filled with lead. Even if it is the only remaining copy of a rare children’s book worth thousands of dollars and only will ever be handled by collectors, you cannot sell it because you can’t prove it is not filled with lead.

Anything manufactured after November 10th 2008 should have come with a certificate certifying it has been tested for lead. If your distributor didn’t provide one, you need to call and get one. As of Febuary 10th, its in fact illegal for your distributor to sell you a kids’ book without a certificate of lead testing, no matter when it was printed.

Objects without a certification still have to be tested. So those copies of Harry Potter and the Deathly Hollows that were printed in 2007 that are still available new at Amazon may have to be destroyed as of February 10th 2009 because they haven’t been tested for lead. (Amazon is taking this seriously and sent a mail to all affiliates asking them to provide the lead testing certificates for all items)

How bad can the punishment be? For selling books? Up to $100,000 PER ITEM and up to five years in jail. It’s also a felony. Get busted, you may lose your right to vote in some states. Even if you can fight it in court, you’ll likely go broke doing so and your local newspaper will carry the headline “Local business selling lead tainted goods”… even though you know they aren’t. Good luck getting them to print the retraction months or years later after that PR disaster.

This includes not just selling, but distribution. So you can’t donate the untested goods to your local library, Good Will, or literacy program. You also can’t sell them to overseas collectors either, as they’re illegal to export. (preventing dumping of truly toxic goods on third world markets is one of the few good portions of this law. Good job on that, bad job on the rest)

This leaves you, the bookseller, with two legal options: store the books indefinitely, hoping regulations change, OR destroy them.

What to do? Write your Congressman. You can look up the mailing info for your Congressman and Senators through House.gov and Senate.gov Call them on the phone too! Some of them may have a staffer dedicated to handling inquiries or willing to tell you which of the many addresses will get the mail in your representatives hands fastest.

You can find more info on CPSIA on:

Foreign dealers, this does effect ALL imports, even individual items shipped through the mail. Try writing to your country’s consulate in the US. They cannot directly effect legislation, but can certainly express their concern to government officials in the US.

EDIT: as of 1/8/09 CPSC has issued an exemption for second hand dealers. New books are STILL not exempted, but step in right directionm. (and no guarantee they won’t change their mind again)

Press release on exemption here

Monday, February 02, 2009

She So Loved the Library She Left It Her Inheritance

The Baltimore Sun reports that Enoch Pratt Free Library officials happily discovered the esteem one of their retirees held for the place.

At her death, Sara (Bunny) Siebert directed that more than $650,000 of her assets go to the library, a figure that exceeds the total of all the paychecks she took home in her 34 years as Pratt's director of young adult reading. She died at age 88 last year.

Siebert, an energetic and popular librarian who sought no attention as a donor during her life, left an estate of more than $2 million.

Having no survivors, she divided her assets among the Baltimore institutions she admired including the Pratt Library and her alma mater, Goucher College.

Google's Got All the Marbles, posted by birdie

Robert Darnton, head of the Harvard library system, writes in a lengthy article in the February 12th issue of the New York Review of Books:

"Google will enjoy what can only be called a monopoly--a monopoly of a new kind, not of railroads or steel but of access to information. Google has no serious competitors. Google alone has the wealth to digitize on a massive scale. And having settled with the authors and publishers, it can exploit its financial power from within a protective legal barrier; for the class action suit covers the entire class of authors and publishers."

He also discusses the economics of professional journals and how the system has changed over the past hundred years. A portion of his commentary:

"The result stands out on the acquisitions budget of every research library: the Journal of Comparative Neurology now costs $25,910 for a year's subscription; Tetrahedron costs $17,969 (or $39,739, if bundled with related publications as a Tetrahedron package); the average price of a chemistry journal is $3,490; and the ripple effects have damaged intellectual life throughout the world of learning. Owing to the skyrocketing cost of serials, libraries that used to spend 50 percent of their acquisitions budget on monographs now spend 25 percent or less. University presses, which depend on sales to libraries, cannot cover their costs by publishing monographs. And young scholars who depend on publishing to advance their careers are now in danger of perishing."

Tons of Twittery Tips

From http://lisnews.org

Posted by

From John Kremer, Book Market, here are just some of his twitter suggestions:

Google Twitter Gadget: Allows you to read and update Twitters right on your desktop.

Loud Twitter: This batch-tweeting service allows users to set up automatic posting of their tweets to their blog (a listing of tweets once per day).

Mr. Tweet: Helps you build meaningful relationships on Twitter, showing you the followers and influencers you should follow. Also recommends you to enthusiastic users relevant to you.

A free service that allows you to customize your Twitter page. Again, a customized page is a boon for helping you to brand yourself on Twitter. I (www.bookmarket.com) used this service to produce my current Twitter background in five minutes.

Ping Vine: A free service that takes an Atom or RSS feed from your blog, lifestream or favorite website and posts it to Twitter, Ping.fm or Identi.ca. Hence, you can automatically post your blog posts to Twitter via RSS feed.