scrapinghub/autoextract-spiders issues and pull requests

#29 - Update pyyaml requirement from <=3.13,>=3.10 to >=3.10,<=6.0.1 in the pip group across 1 directory

Pull Request - State: open - Opened by dependabot[bot] over 1 year ago
Labels: dependencies

#28 - Include -s FRONTERA_DISABLED=True in README examples to ensure local …

Pull Request - State: closed - Opened by ivanprado over 4 years ago

#27 - Set/reset Frontera for all page types

Pull Request - State: closed - Opened by vshlapakov about 5 years ago
Labels: bug

#26 - Breadth-first order for crawling

Pull Request - State: closed - Opened by ivanprado about 5 years ago

#25 - Deduplication to avoid infinite loops. Better priority queue for multiple domains.

Pull Request - State: closed - Opened by ivanprado about 5 years ago - 15 comments

#24 - Fix discovery because body was never empty

Pull Request - State: closed - Opened by ivanprado about 5 years ago - 1 comment

#23 - Mention seeds-file-url in the docstring

Pull Request - State: closed - Opened by vshlapakov about 5 years ago
Labels: documentation

#22 - Make the seeds-file-url param optional

Pull Request - State: closed - Opened by vshlapakov about 5 years ago
Labels: bug

#21 - Provide support for a seeds file url

Pull Request - State: closed - Opened by vshlapakov about 5 years ago - 7 comments
Labels: enhancement

#20 - Fixes and improvements

Pull Request - State: closed - Opened by croqaz over 5 years ago

#19 - Provide Docker image based on SC stack

Pull Request - State: closed - Opened by vshlapakov over 5 years ago
Labels: enhancement

#18 - Checking for invalid feeds may be too strict

Issue - State: closed - Opened by kmike over 5 years ago - 1 comment

#17 - Why is html5 stripping called for response.url?

Issue - State: closed - Opened by kmike over 5 years ago

#16 - Does rss.xml link discovery work?

Issue - State: closed - Opened by kmike over 5 years ago - 1 comment

#15 - Support for job posting

Pull Request - State: closed - Opened by croqaz over 5 years ago - 1 comment

#14 - Updated dependencies

Pull Request - State: closed - Opened by croqaz over 5 years ago

#13 - Added spider User-Agent header

Pull Request - State: closed - Opened by croqaz over 5 years ago - 1 comment

#12 - Use newer scrapy:1.8-py3 stack

Pull Request - State: closed - Opened by vshlapakov over 5 years ago - 1 comment
Labels: enhancement

#11 - Describe using Crawlera with AE

Pull Request - State: closed - Opened by vshlapakov over 5 years ago - 1 comment

#10 - Add optional fake-useragent support

Pull Request - State: closed - Opened by vshlapakov over 5 years ago - 1 comment

#9 - Add optional Crawlera support

Pull Request - State: closed - Opened by vshlapakov over 5 years ago - 2 comments

#8 - Implemented date filter rules, specified as spider arg

Pull Request - State: open - Opened by croqaz over 5 years ago

#7 - Better de-duplication of URLs

Issue - State: open - Opened by croqaz over 5 years ago

#6 - Filter extracted articles by date

Issue - State: open - Opened by croqaz over 5 years ago
Labels: enhancement

#5 - It adds Fake UserAgent support

Pull Request - State: open - Opened by rafaelcapucho almost 6 years ago - 3 comments

#4 - It adds optional Crawlera support

Pull Request - State: closed - Opened by rafaelcapucho almost 6 years ago - 1 comment

#3 - Add optional AWS S3 export feature

Pull Request - State: closed - Opened by vshlapakov almost 6 years ago - 3 comments

#2 - Update Scrapy stack version

Pull Request - State: closed - Opened by vshlapakov almost 6 years ago - 2 comments

#1 - Add Frontera integration via HCF

Pull Request - State: closed - Opened by vshlapakov almost 6 years ago - 4 comments

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Issues

GitHub / scrapinghub/autoextract-spiders issues and pull requests