01-05-2024, 06:04 AM | #1 |
Enthusiast
Posts: 27
Karma: 10000
Join Date: Jan 2019
Device: Kindle PW4
|
Advice on how to scrape or use an api for thousands of books?
Hello.
First of all I use Calibre but I am learning development and I decided to make a cli script that gets metadata for all books in a folder. I managed to make most of it work but I have a problem figuring out how to do a query going over several thousands or even 100000. Google books has a daily limit of 1000 and open library has a limit of 100 every 5 minutes. I heard Kovid mention that he used duckduckgo. Mr. Goyal if you read this by any chance could you please tell me how you did it? I wanted to use pypupeteer on duckduckgo or google but i can't figure out based on their robots.txt what is and isn't permitted. I don't want to get blacklisted by a mistake. I also found out that Google thinks queries above 100000 are chump change to them and they will increase it for free if i get issued an api key for which I have to put in my credit card information. I don't think the users and frankly myself are comfortable with that. Thank you for reading. Please let me know if i need to further elaborate. |
01-05-2024, 06:25 AM | #2 |
creator of calibre
Posts: 44,006
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
There's no free service that will let you make that many queries. Indeed nowadays even google restricts you to about 50 ish queries a day.
|
01-05-2024, 09:13 AM | #3 | |
Enthusiast
Posts: 27
Karma: 10000
Join Date: Jan 2019
Device: Kindle PW4
|
Quote:
I saw that they publicly state that unverified accounts get 1000 per day. I guess the days of asking for even a measly 10000 are over? I saw recent SO threads stating it could still be done, with a credit card, but I also saw a lot of posts about being denied. I thought they asked for way to much. If I decide to scrape the info using duckduckgo, google, and bing interchangeably, lets say i divide into 15000 for each, is there a possibility of them blacklisting me, blacklisting my MAC address and HWID? I honestly can't tell from their robots.txt what is and isn't permitted. |
|
01-05-2024, 09:30 AM | #4 |
creator of calibre
Posts: 44,006
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
calibre does not use google apis it queries the same urls as you do when you use a browser. And google rate limits these queries to 50 odd a day.
|
01-05-2024, 10:26 AM | #5 | |
Enthusiast
Posts: 27
Karma: 10000
Join Date: Jan 2019
Device: Kindle PW4
|
Quote:
That is what i meant. Could you please tell me the workflow in a sentence or two. What websites do you use and would headless browser and selenium be enough? I wanted to also use a browser but was worried that that site or the search engine might block me. Someone praised my project and said it would help them with 50000 books. I immediately thought that my project couldn't accomplish that task and have spent days trying to make it happen as it would be a nice feature as I am working on my portfolio. I thought about the complexity of the algorithm but even if it were O(n ^ n) even a million operations isn't much and if i get into trouble i could port that python code to go and revert back when Python supersedes the GIL. The logic i found here: https://github.com/kovidgoyal/calibr...mazon.py#L1094 https://github.com/kovidgoyal/calibr...ngines.py#L177 goes way over my head. I don't know if i should be focusing exclusively on cached pages and instant searches or if i could just do a search for {Title} AND {Author} with {publisher}, {rating}, {rating_count} IN. |
|
01-05-2024, 10:33 AM | #6 |
Enthusiast
Posts: 27
Karma: 10000
Join Date: Jan 2019
Device: Kindle PW4
|
Oh and of course it would not be O n^n that was just as an example it should be in the range of O^n2 or n3 at the worst while using asyncio and aiosqlite and sqlalchemy maybe even approaching O n.
I am just trying to do something to show to recruiters i manage the basics. |
01-05-2024, 10:37 PM | #7 |
creator of calibre
Posts: 44,006
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
headless browser and selenium are fine but you will not be able to scrape large numbers of results.
|
01-08-2024, 07:33 PM | #8 |
Grand Sorcerer
Posts: 6,531
Karma: 26425959
Join Date: Apr 2009
Location: USA
Device: iPhone 15PM, Kindle Scribe, iPad mini 6, PocketBook InkPad Color 3
|
After running into the Google 'too many requests' error too many times, I switched to using the Goodreads plugin to fetch metadata. Since then, no issues, and the metadata (particularly Series) seems better overall. I rarely need to correct it.
It may well be Goodreads API has similar limit, but by the time I switched I no longer had hundreds of books to fetch metadata for. |
01-12-2024, 05:52 AM | #9 | |
Enthusiast
Posts: 27
Karma: 10000
Join Date: Jan 2019
Device: Kindle PW4
|
Quote:
Unfortunately I missed my chance and Goodreads does not issue new api keys. |
|
Tags |
api, googlebooks api, openlibrary api, scraping |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to feed a Kobo with thousands of books? | Mingyar | Kobo Reader | 31 | 03-15-2022 09:28 AM |
Sony's New German Ebookstore Features Thousands Of DRM-Free Books | kesey | News | 5 | 12-15-2012 01:13 PM |
When an e-reader is loaded with thousands of books, does it gain any weight? | Hoyt Clagwell | General Discussions | 29 | 11-10-2011 02:32 PM |
Random House to digitize thousands of books | DonaldL. | News | 34 | 12-04-2008 08:39 AM |
Random House to digitize thousands of books | zelda_pinwheel | News | 0 | 11-24-2008 09:58 AM |