fhamborg/news-please issues and pull requests

#288 - Adds support for exporting to OpenSearch

Pull Request - State: open - Opened by wild5r 2 months ago - 2 comments

#287 - Update documentation

Pull Request - State: closed - Opened by Medno 2 months ago - 1 comment

#286 - Correctly extract all Sitemaps urls from robots.txt

Pull Request - State: closed - Opened by yldoctrine 4 months ago - 1 comment

#285 - Allow the RssCrawler to search for an RSS feed from the provided url

Pull Request - State: open - Opened by yldoctrine 4 months ago - 8 comments

#284 - Make fetching images options for from_url and from_urls methods

Pull Request - State: closed - Opened by TMCarrier 4 months ago - 3 comments

#283 - Add missing dependencies to requirements.txt

Pull Request - State: closed - Opened by jkawamoto 4 months ago - 1 comment

#282 - Pass optional arguments to requests.get

Pull Request - State: closed - Opened by jkawamoto 4 months ago - 6 comments

#281 - Add check_certificate option in configuration to be able to crawl sites not having a valid certificate

Pull Request - State: closed - Opened by yldoctrine 4 months ago - 2 comments

#280 - Fix article_extractor to only instantiate the Extractors in the specified file

Pull Request - State: closed - Opened by yldoctrine 4 months ago - 1 comment

#279 - Encode empty "html_title" if it's not found

Pull Request - State: closed - Opened by Medno 5 months ago - 2 comments

#278 - Bugfixes

Pull Request - State: closed - Opened by t1h0 5 months ago

#277 - Remove typing_extensions.cast

Pull Request - State: closed - Opened by t1h0 5 months ago - 1 comment

#276 - Raise typing-extensions and python version

Pull Request - State: closed - Opened by t1h0 5 months ago - 1 comment

#275 - typing_extensions missing in requirements

Issue - State: closed - Opened by t1h0 5 months ago - 1 comment

#274 - Add robustness for pages missing title

Pull Request - State: closed - Opened by yldoctrine 5 months ago

#273 - Remove newspaper3k from setup

Pull Request - State: closed - Opened by Medno 5 months ago - 1 comment

#272 - Fix type annotations for earlier Python versions

Pull Request - State: closed - Opened by Medno 5 months ago - 2 comments

#271 - Add an additional sitemap check in Sitemap crawlers

Pull Request - State: closed - Opened by Medno 5 months ago - 3 comments

#270 - Add 4 date metadata patterns in date extraction

Pull Request - State: closed - Opened by anteverse 5 months ago - 2 comments

#269 - Add Redis pipeline

Pull Request - State: closed - Opened by anteverse 6 months ago - 5 comments

#268 - Add unique constraint on column `url` on table `CurrentVersions` in Postgres pipeline

Pull Request - State: closed - Opened by anteverse 6 months ago - 3 comments

#267 - Cast get_max_url_file_name_length result to int

Pull Request - State: closed - Opened by Medno 6 months ago

#266 - Allow RssCrawler to fallback if there was an empty RSS feed

Pull Request - State: closed - Opened by Medno 6 months ago - 4 comments

#265 - Use SPIDER_MODULES field from configuration file if it's defined

Pull Request - State: closed - Opened by Medno 6 months ago - 2 comments

#264 - Use newspaper4k instead of abandoned newspaper3k

Pull Request - State: closed - Opened by t1h0 6 months ago - 3 comments

#263 - Use newspaper4k instead of abandoned newspaper3k

Pull Request - State: closed - Opened by t1h0 6 months ago - 3 comments

#261 - Add the possibility to define a schema for postgresql database

Pull Request - State: closed - Opened by yldoctrine 7 months ago - 7 comments

#260 - Unable to change URLS from example URLS

Issue - State: closed - Opened by rinaforristal 7 months ago - 2 comments

#259 - ImportError: libpq.so.5: cannot open shared object file: No such file or directory

Issue - State: closed - Opened by Pasanlaksitha 9 months ago - 1 comment

#258 - Reuter news scrip failed

Issue - State: closed - Opened by pepingreat 9 months ago - 1 comment

#257 - maintext article attribute length limitation

Issue - State: closed - Opened by zurek11 about 1 year ago - 1 comment

#256 - Make SimpleCrawer.fetch_urls() accept custom timeout

Pull Request - State: closed - Opened by StepinSilence about 1 year ago - 1 comment

#255 - Implement user agent functionality similar to News Paper 3k

Issue - State: closed - Opened by GiridharRNair about 1 year ago

#253 - can not extract main text.

Issue - State: closed - Opened by simplew2011 about 1 year ago - 1 comment

#252 - fixes #172 and #169: NewsPlease.from_urls() - use multiprocessing

Pull Request - State: closed - Opened by arcolife over 1 year ago - 3 comments

#251 - Change Crawlers to RecursiveCrawler with as a library and store to Mongodb

Issue - State: closed - Opened by Anhduchb01 over 1 year ago - 1 comment

#250 - Unable to Crawl and Save PDF files

Issue - State: closed - Opened by simrankaur20 over 1 year ago - 1 comment

#249 - Fix tag

Pull Request - State: closed - Opened by sharockys over 1 year ago

#248 - Sharockys docker action

Pull Request - State: closed - Opened by sharockys over 1 year ago

#247 - Newer version of ElasticSearch API changed a lot

Issue - State: closed - Opened by sharockys over 1 year ago

#246 - Adaptation for recent version of ElasticSearch Python API

Pull Request - State: closed - Opened by sharockys over 1 year ago - 1 comment

#245 - Replace cchardet with the Python 3.11.x compatible faust-cchardet

Pull Request - State: closed - Opened by eliias over 1 year ago

#244 - `NewsPlease.from_urls` updated to behave consistently with 404 urls.

Pull Request - State: closed - Opened by loganamcnichols over 1 year ago

#243 - NewsPlease.from_urls behaves inconsistently in situations where a url results in 404

Issue - State: closed - Opened by loganamcnichols over 1 year ago

#242 - Scrape by Domain

Issue - State: closed - Opened by firmai over 1 year ago - 1 comment

#241 - Error : You must `download()` an article first!

Issue - State: closed - Opened by PYogesh over 1 year ago - 2 comments

#240 - Update README.md, add descriptions for JSON

Pull Request - State: closed - Opened by Ahacad over 1 year ago

#239 - Specify more recent awscli dependency to avoid dependency resolution issues

Issue - State: closed - Opened by phoerious almost 2 years ago - 8 comments

#238 - DateFilter is never used

Issue - State: closed - Opened by namlede almost 2 years ago - 7 comments

#237 - Failed to build for python 3.11

Issue - State: closed - Opened by mattiasrubenson almost 2 years ago - 3 comments

#236 - Get only the recursive list of URLs using the Library mode

Issue - State: closed - Opened by bakrianoo about 2 years ago - 2 comments

#235 - ModuleNotFoundError: No module named 'newsplease'

Issue - State: closed - Opened by ghost about 2 years ago - 3 comments

#234 - Proxy Server configuration (HttpProxyMiddleware)

Issue - State: open - Opened by bkrishnap over 2 years ago - 4 comments
Labels: help wanted

#232 - Configure options to optimize the crawling and extraction process

Issue - State: closed - Opened by kvasilopoulos over 2 years ago

#231 - news-please at background

Issue - State: closed - Opened by noerarief23 over 2 years ago - 2 comments

#229 - Temporary failure in name resolution

Issue - State: closed - Opened by sara-02 over 2 years ago - 1 comment

#228 - Avoiding restart of commoncrawl scraping process

Issue - State: closed - Opened by joemkwon over 2 years ago

#227 - Fix Elasticsearch package version

Pull Request - State: closed - Opened by bakrianoo over 2 years ago - 1 comment

#226 - Common Crawl crawler: adapt to new data access scheme, fixes #223

Pull Request - State: closed - Opened by sebastian-nagel over 2 years ago

#225 - Execution neither possible on current Mac OS nor Windows 10

Issue - State: closed - Opened by vivianevv over 2 years ago

#224 - crawl_from_commoncrawl crashes when attempting to parse the date from warc.paths.gz

Issue - State: closed - Opened by Loumstar over 2 years ago - 1 comment

#223 - Update s3://commoncrawl/ access scheme

Issue - State: closed - Opened by sebastian-nagel over 2 years ago - 9 comments

#222 - try migrate fsspec

Pull Request - State: closed - Opened by sshleifer over 2 years ago

#219 - Required time by commoncrawl extractor and bug in logging

Issue - State: closed - Opened by lucadiliello about 3 years ago - 1 comment

#218 - Connection doesn't get rollbacked using PostgresqlStorage Pipeline

Issue - State: closed - Opened by flatplate about 3 years ago - 1 comment

#217 - ignore_regex configuration option in config.cfg is not working properly

Issue - State: closed - Opened by marvingabler over 3 years ago - 1 comment

#216 - Refactored MySQLStorage

Pull Request - State: closed - Opened by siglun88 over 3 years ago

#215 - cannot get related maintext

Issue - State: closed - Opened by farzad-845 over 3 years ago - 1 comment

#214 - Added cchardet installation to readme

Pull Request - State: closed - Opened by lodenrogue over 3 years ago - 1 comment

#213 - Article not giving full text

Issue - State: closed - Opened by lodenrogue over 3 years ago - 3 comments

#212 - Add Github Actions build

Pull Request - State: closed - Opened by gliptak over 3 years ago - 5 comments

#211 - Error: slice indices must be integers or None or have an index method

Issue - State: closed - Opened by aljbri over 3 years ago - 1 comment

#210 - Support warc_files_end_date for common crawl crawler.

Pull Request - State: closed - Opened by shangw-nvidia over 3 years ago - 2 comments

#209 - Publish datetime timezone

Issue - State: closed - Opened by dhesru over 3 years ago - 1 comment

#208 - Bypass Paywall with credentials

Issue - State: closed - Opened by maxschaeufele over 3 years ago - 1 comment

#207 - CommonCrawl.py example

Issue - State: closed - Opened by keimiii over 3 years ago - 5 comments

#206 - Commoncrawl.py example NameError

Issue - State: closed - Opened by keimiii over 3 years ago - 1 comment

#205 - Issue with Commoncrawl.py example

Issue - State: closed - Opened by keimiii over 3 years ago - 1 comment

#204 - commoncrawl.py won't filter by host

Issue - State: closed - Opened by WilliamGough almost 4 years ago