www.ShoppingPodder.com

Leading Computer Shopping,
News and information


Part of the Identityscape.com network...

getxfactor.com jmoodmusic.com smartbusinesschoices.com mintdepot.com lowfaresalways.com evangelicalview.com shoppingpodder.com soproudlywehail.com webnews.ws currenthumor.com

 

 

Judging whether a URL exists among millions, insert if not
   Shopping Podder - the Best of Computer Postings! Forum Index -> Computer Artificial Intelligence - Genetic  
View previous topic :: View next topic  
Author Message
Fred
Guest






PostPosted: Thu Aug 21, 2008 2:24 am    Post subject: Judging whether a URL exists among millions, insert if not Reply with quote

***I've posted this message on another group but got no satisfactory
answer. Sorry if you receive duplicate mails***
Hi, all:
I've got such a problem: there are millions of URLs in the
database, and more URLs are being obtained. In order to update the
database with new URLs, firstly, I have to judge whether a certain URL
exists, if not then insert it into the database.
Directly using [ if not exists...] sentences poses large
pressure on database, for it's impossible to create an index on 'URL'
column. The reason is some of the URLs have a length more than 1024
bytes, which exceeds the limit of SQL 2005.
My solution is system.collections.hashtable. It's quite simple
to implement the logic, however, the memory consumption is rather
large. I've got about 1.5 million URLs, loading all of them into the
hashtable costs more than 1GB memory. Of course, 1GB is acceptable
now, but if the amount of URL doubles or even triples, the memory
consumption could be a big problem.
I've also 'devised' some other solutions:
1. Get the MD5 code of each URL, then insert them into a RB-
tree or B-tree.
2. At first pick the string between the begining 'http://' and
fist '/' after 'http://', then save it in a hashtable, after that,
save the reminding string in a sorted list or B-tree/RB-tree.
Considering the balance between time and memory consumption,
which do you think would be the best method, or there're better
way(there must be) to deal with it?
Back to top
Kent Paul Dolan
Guest






PostPosted: Thu Aug 21, 2008 6:39 pm    Post subject: Re: Judging whether a URL exists among millions, insert if n Reply with quote

Fred wrote:

Quote:
I've got such a problem: there are millions of
URLs in the database, and more URLs are being
obtained. In order to update the database with new
URLs, firstly, I have to judge whether a certain
URL exists, if not then insert it into the
database.

The biggest issue in making your choice is how much
performance you need, and for how long your design
has to continue functioning. Remember, computer
memories will grow over time, perhaps faster than
your URL database's need for space.

If the arrival rate of URLs is slow enough to allow
accessing the tables from disk instead of memory,
you can eschew building the lookup tables in memory
and instead build the tables on disk where you have
lots more space available, letting you do that
to build a quick and dirty lookup mechanism out of
readily available open source parts instead of
attempting any more complicated scheme.

If your need for performance is great, building a
quick and dirty lookup table won't suffice. Instead,
you can probably find an SQL compliant database
system that supports BLOBS (binary large objects)
and use those rather than limited length records to
store the spelled out URLs. BLOBS have been in use
since the previous millennium, it would suprise me
more than a bit if a 2005-standard SQL didn't
support them.

As to which hash you use, sure, a canned one like
MD5 would be a good choice.

xanthian.

Sorry that answer is fairly fuzzy and maybe not too
helpful.
Back to top
Fred
Guest






PostPosted: Fri Aug 22, 2008 12:48 am    Post subject: Re: Judging whether a URL exists among millions, insert if n Reply with quote

On Aug 21, 9:39 pm, Kent Paul Dolan <xanth...@well.com> wrote:
Quote:
Fred wrote:

 > I've got such a problem: there are millions of
 > URLs in the database, and more URLs are being
 > obtained. In order to update the database with new
 > URLs, firstly, I have to judge whether a certain
 > URL exists, if not then insert it into the
 > database.

The biggest issue in making your choice is how much
performance you need, and for how long your design
has to continue functioning. Remember, computer
memories will grow over time, perhaps faster than
your URL database's need for space.

If the arrival rate of URLs is slow enough to allow
accessing the tables from disk instead of memory,
you can eschew building the lookup tables in memory
and instead build the tables on disk where you have
lots more space available, letting you do that
to build a quick and dirty lookup mechanism out of
readily available open source parts instead of
attempting any more complicated scheme.

If your need for performance is great, building a
quick and dirty lookup table won't suffice. Instead,
you can probably find an SQL compliant database
system that supports BLOBS (binary large objects)
and use those rather than limited length records to
store the spelled out URLs. BLOBS have been in use
since the previous millennium, it would suprise me
more than a bit if a 2005-standard SQL didn't
support them.

As to which hash you use, sure, a canned one like
MD5 would be a good choice.

xanthian.

Sorry that answer is fairly fuzzy and maybe not too
helpful.

Thanks Kent. I'll look up BLOLBS in google.
Back to top
SucMucPaProlij
Guest






PostPosted: Mon Aug 25, 2008 11:11 am    Post subject: Re: Judging whether a URL exists among millions, insert if n Reply with quote

Quote:
My solution is system.collections.hashtable. It's quite simple
to implement the logic, however, the memory consumption is rather
large. I've got about 1.5 million URLs, loading all of them into the
hashtable costs more than 1GB memory. Of course, 1GB is acceptable
now, but if the amount of URL doubles or even triples, the memory
consumption could be a big problem.
I've also 'devised' some other solutions:
1. Get the MD5 code of each URL, then insert them into a RB-
tree or B-tree.
2. At first pick the string between the begining 'http://' and
fist '/' after 'http://', then save it in a hashtable, after that,
save the reminding string in a sorted list or B-tree/RB-tree.
Considering the balance between time and memory consumption,
which do you think would be the best method, or there're better
way(there must be) to deal with it?

There are many methods you can use to create optimal data structure.

For example:
http://en.wikipedia.org/wiki/Trie

You can also optimize your hash table - if you hold in the memory only hash
value and string ID while storing URLs in a table that has two columns (ID and
URL) you will need approx. 15-20 MB of memory to store 1.5 million URLs but
searching will be slower.
Back to top
Display posts from previous:   
   Shopping Podder - the Best of Computer Postings! Forum Index -> Computer Artificial Intelligence - Genetic  
Page 1 of 1
All times are GMT

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum