Advice on how to scrape or use an api for thousands of books?

Ico · 01-05-2024, 06:04 AM

Hello.

First of all I use Calibre but I am learning development and I decided to make a cli script that gets metadata for all books in a folder.

I managed to make most of it work but I have a problem figuring out how to do a query going over several thousands or even 100000.

Google books has a daily limit of 1000 and open library has a limit of 100 every 5 minutes.

I heard Kovid mention that he used duckduckgo. Mr. Goyal if you read this by any chance could you please tell me how you did it?

I wanted to use pypupeteer on duckduckgo or google but i can't figure out based on their robots.txt what is and isn't permitted.
I don't want to get blacklisted by a mistake.

I also found out that Google thinks queries above 100000 are chump change to them and they will increase it for free if i get issued an api key for which I have to put in my credit card information.

I don't think the users and frankly myself are comfortable with that.

Thank you for reading. Please let me know if i need to further elaborate.

kovidgoyal · 01-05-2024, 06:25 AM

There's no free service that will let you make that many queries. Indeed nowadays even google restricts you to about 50 ish queries a day.

Ico · 01-05-2024, 09:13 AM

Quote:

Originally Posted by kovidgoyal

There's no free service that will let you make that many queries. Indeed nowadays even google restricts you to about 50 ish queries a day.

Thank you I see so you pay for the google cloud from donations?

I saw that they publicly state that unverified accounts get 1000 per day.

I guess the days of asking for even a measly 10000 are over?

I saw recent SO threads stating it could still be done, with a credit card, but I also saw a lot of posts about being denied.
I thought they asked for way to much.

If I decide to scrape the info using duckduckgo, google, and bing interchangeably, lets say i divide into 15000 for each, is there a possibility of them blacklisting me, blacklisting my MAC address and HWID?
I honestly can't tell from their robots.txt what is and isn't permitted.

kovidgoyal · 01-05-2024, 09:30 AM

calibre does not use google apis it queries the same urls as you do when you use a browser. And google rate limits these queries to 50 odd a day.

Ico · 01-05-2024, 10:26 AM

Quote:

Originally Posted by kovidgoyal

calibre does not use google apis it queries the same urls as you do when you use a browser. And google rate limits these queries to 50 odd a day.

That is great. That is what i was meaning to actually ask, sorry i kind of got confused i am trying to finish several projects.

That is what i meant.

Could you please tell me the workflow in a sentence or two.
What websites do you use and would headless browser and selenium be enough?

I wanted to also use a browser but was worried that that site or the search engine might block me.
Someone praised my project and said it would help them with 50000 books.

I immediately thought that my project couldn't accomplish that task and have spent days trying to make it happen as it would be a nice feature as I am working on my portfolio.

I thought about the complexity of the algorithm but even if it were O(n ^ n) even a million operations isn't much and if i get into trouble i could port that python code to go and revert back when Python supersedes the GIL.

The logic i found here:
https://github.com/kovidgoyal/calibr...mazon.py#L1094
https://github.com/kovidgoyal/calibr...ngines.py#L177

goes way over my head.

I don't know if i should be focusing exclusively on cached pages and instant searches or if i could just do a search for {Title} AND {Author} with {publisher}, {rating}, {rating_count} IN.

Ico · 01-05-2024, 10:33 AM

Oh and of course it would not be O n^n that was just as an example it should be in the range of O^n2 or n3 at the worst while using asyncio and aiosqlite and sqlalchemy maybe even approaching O n.

I am just trying to do something to show to recruiters i manage the basics.

kovidgoyal · 01-05-2024, 10:37 PM

headless browser and selenium are fine but you will not be able to scrape large numbers of results.

tomsem · 01-08-2024, 07:33 PM

After running into the Google 'too many requests' error too many times, I switched to using the Goodreads plugin to fetch metadata. Since then, no issues, and the metadata (particularly Series) seems better overall. I rarely need to correct it.

It may well be Goodreads API has similar limit, but by the time I switched I no longer had hundreds of books to fetch metadata for.

Ico · 01-12-2024, 05:52 AM

Quote:

Originally Posted by tomsem

After running into the Google 'too many requests' error too many times, I switched to using the Goodreads plugin to fetch metadata. Since then, no issues, and the metadata (particularly Series) seems better overall. I rarely need to correct it.

It may well be Goodreads API has similar limit, but by the time I switched I no longer had hundreds of books to fetch metadata for.

That is great.
Unfortunately I missed my chance and Goodreads does not issue new api keys.

01-05-2024, 06:04 AM	#1
Ico Enthusiast Posts: 27 Karma: 10000 Join Date: Jan 2019 Device: Kindle PW4	Advice on how to scrape or use an api for thousands of books? Hello. First of all I use Calibre but I am learning development and I decided to make a cli script that gets metadata for all books in a folder. I managed to make most of it work but I have a problem figuring out how to do a query going over several thousands or even 100000. Google books has a daily limit of 1000 and open library has a limit of 100 every 5 minutes. I heard Kovid mention that he used duckduckgo. Mr. Goyal if you read this by any chance could you please tell me how you did it? I wanted to use pypupeteer on duckduckgo or google but i can't figure out based on their robots.txt what is and isn't permitted. I don't want to get blacklisted by a mistake. I also found out that Google thinks queries above 100000 are chump change to them and they will increase it for free if i get issued an api key for which I have to put in my credit card information. I don't think the users and frankly myself are comfortable with that. Thank you for reading. Please let me know if i need to further elaborate.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to feed a Kobo with thousands of books?	Mingyar	Kobo Reader	31	03-15-2022 09:28 AM
Sony's New German Ebookstore Features Thousands Of DRM-Free Books	kesey	News	5	12-15-2012 01:13 PM
When an e-reader is loaded with thousands of books, does it gain any weight?	Hoyt Clagwell	General Discussions	29	11-10-2011 02:32 PM
Random House to digitize thousands of books	DonaldL.	News	34	12-04-2008 08:39 AM
Random House to digitize thousands of books	zelda_pinwheel	News	0	11-24-2008 09:58 AM

01-05-2024, 06:25 AM	#2
kovidgoyal creator of calibre Posts: 44,006 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There's no free service that will let you make that many queries. Indeed nowadays even google restricts you to about 50 ish queries a day.

01-05-2024, 09:30 AM	#4
kovidgoyal creator of calibre Posts: 44,006 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	calibre does not use google apis it queries the same urls as you do when you use a browser. And google rate limits these queries to 50 odd a day.

01-05-2024, 10:33 AM	#6
Ico Enthusiast Posts: 27 Karma: 10000 Join Date: Jan 2019 Device: Kindle PW4	Oh and of course it would not be O n^n that was just as an example it should be in the range of O^n2 or n3 at the worst while using asyncio and aiosqlite and sqlalchemy maybe even approaching O n. I am just trying to do something to show to recruiters i manage the basics.

01-05-2024, 10:37 PM	#7
kovidgoyal creator of calibre Posts: 44,006 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	headless browser and selenium are fine but you will not be able to scrape large numbers of results.

01-08-2024, 07:33 PM	#8
tomsem Grand Sorcerer Posts: 6,531 Karma: 26425959 Join Date: Apr 2009 Location: USA Device: iPhone 15PM, Kindle Scribe, iPad mini 6, PocketBook InkPad Color 3	After running into the Google 'too many requests' error too many times, I switched to using the Goodreads plugin to fetch metadata. Since then, no issues, and the metadata (particularly Series) seems better overall. I rarely need to correct it. It may well be Goodreads API has similar limit, but by the time I switched I no longer had hundreds of books to fetch metadata for.