Baidu Wenku Downloader
A self-developed tool for downloading documents from Baidu Wenku (百度文库), based on reverse engineering Baidu Wenku's Canvas rendering mechanism.
Official Website
Official website: Gitea Repository
GitHub mirror: GitHub Repository
For complete source code access, please contact the author at: lostjaychi@gmail.com
Features
- Download PDF documents from Baidu Wenku
- Support for various document formats
- Asynchronous processing for better performance
Demo Video
Watch the demo video on Bilibili: Baidu Wenku Downloader Demo
How to Find Document ID
The document ID is a unique identifier in the Baidu Wenku URL. For example, in the URL:
https://wenku.baidu.com/view/1898f455874769eae009581b6bd97f192279bff4.html
The document ID is: 1898f455874769eae009581b6bd97f192279bff4
You can find this ID in the URL of any Baidu Wenku document page. It's the string of characters between /view/
and .html
in the URL.
Test API
A test API endpoint is available for testing PDF downloads:
https://lostjay.xyz/api/get_pdf?doc_id=6bb03e2669dc5022aaea00e5
Important Note: Only one PDF can be processed at a time. If the server is busy, you may receive a "Server is busy, please try again later" response.
Prerequisites
- Python 3.11 or higher
- Node.js and npm
Installation
- Install Python dependencies:
pip install .
- Install Node.js dependencies:
npm install
- Start the Node.js service:
npm start
Project Structure
baiduwenkudownloader/
: Main Python packageCrawlerUtils/
: Utility functions for web crawlingtest/
: Test filestest_downloader.py
: Contains main test functions
Testing
The main test function is located in test/test_downloader.py
. To run the PDF download test:
cd test
python -m unittest test_downloader -k test_get_pdf
Dependencies
Python Dependencies
- bs4
- lxml
- curl-cffi
- tenacity
Node.js Dependencies
- canvas
- express
- jspdf
Disclaimer
This project is for educational and research purposes only. Users are responsible for complying with all applicable laws and regulations. The authors do not endorse or encourage any unauthorized use of this software. Please respect intellectual property rights and use this tool responsibly.
License
ISC License