Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / fhamborg/news-please issues and pull requests

#288 - Adds support for exporting to OpenSearch

Pull Request - State: open - Opened by wild5r 2 months ago - 2 comments

#287 - Update documentation

Pull Request - State: closed - Opened by Medno 2 months ago - 1 comment

#286 - Correctly extract all Sitemaps urls from robots.txt

Pull Request - State: closed - Opened by yldoctrine 4 months ago - 1 comment

#285 - Allow the RssCrawler to search for an RSS feed from the provided url

Pull Request - State: open - Opened by yldoctrine 4 months ago - 8 comments

#284 - Make fetching images options for from_url and from_urls methods

Pull Request - State: closed - Opened by TMCarrier 4 months ago - 3 comments

#283 - Add missing dependencies to requirements.txt

Pull Request - State: closed - Opened by jkawamoto 4 months ago - 1 comment

#282 - Pass optional arguments to requests.get

Pull Request - State: closed - Opened by jkawamoto 4 months ago - 6 comments

#280 - Fix article_extractor to only instantiate the Extractors in the specified file

Pull Request - State: closed - Opened by yldoctrine 4 months ago - 1 comment

#279 - Encode empty "html_title" if it's not found

Pull Request - State: closed - Opened by Medno 5 months ago - 2 comments

#278 - Bugfixes

Pull Request - State: closed - Opened by t1h0 5 months ago

#277 - Remove typing_extensions.cast

Pull Request - State: closed - Opened by t1h0 5 months ago - 1 comment

#276 - Raise typing-extensions and python version

Pull Request - State: closed - Opened by t1h0 5 months ago - 1 comment

#275 - typing_extensions missing in requirements

Issue - State: closed - Opened by t1h0 5 months ago - 1 comment

#274 - Add robustness for pages missing title

Pull Request - State: closed - Opened by yldoctrine 5 months ago

#273 - Remove newspaper3k from setup

Pull Request - State: closed - Opened by Medno 5 months ago - 1 comment

#272 - Fix type annotations for earlier Python versions

Pull Request - State: closed - Opened by Medno 5 months ago - 2 comments

#271 - Add an additional sitemap check in Sitemap crawlers

Pull Request - State: closed - Opened by Medno 5 months ago - 3 comments

#270 - Add 4 date metadata patterns in date extraction

Pull Request - State: closed - Opened by anteverse 5 months ago - 2 comments

#269 - Add Redis pipeline

Pull Request - State: closed - Opened by anteverse 6 months ago - 5 comments

#267 - Cast get_max_url_file_name_length result to int

Pull Request - State: closed - Opened by Medno 6 months ago

#266 - Allow RssCrawler to fallback if there was an empty RSS feed

Pull Request - State: closed - Opened by Medno 6 months ago - 4 comments

#265 - Use SPIDER_MODULES field from configuration file if it's defined

Pull Request - State: closed - Opened by Medno 6 months ago - 2 comments

#264 - Use newspaper4k instead of abandoned newspaper3k

Pull Request - State: closed - Opened by t1h0 6 months ago - 3 comments

#263 - Use newspaper4k instead of abandoned newspaper3k

Pull Request - State: closed - Opened by t1h0 6 months ago - 3 comments

#261 - Add the possibility to define a schema for postgresql database

Pull Request - State: closed - Opened by yldoctrine 7 months ago - 7 comments

#260 - Unable to change URLS from example URLS

Issue - State: closed - Opened by rinaforristal 7 months ago - 2 comments

#258 - Reuter news scrip failed

Issue - State: closed - Opened by pepingreat 9 months ago - 1 comment

#257 - maintext article attribute length limitation

Issue - State: closed - Opened by zurek11 about 1 year ago - 1 comment

#256 - Make SimpleCrawer.fetch_urls() accept custom timeout

Pull Request - State: closed - Opened by StepinSilence about 1 year ago - 1 comment

#253 - can not extract main text.

Issue - State: closed - Opened by simplew2011 about 1 year ago - 1 comment

#252 - fixes #172 and #169: NewsPlease.from_urls() - use multiprocessing

Pull Request - State: closed - Opened by arcolife over 1 year ago - 3 comments

#251 - Change Crawlers to RecursiveCrawler with as a library and store to Mongodb

Issue - State: closed - Opened by Anhduchb01 over 1 year ago - 1 comment

#250 - Unable to Crawl and Save PDF files

Issue - State: closed - Opened by simrankaur20 over 1 year ago - 1 comment

#249 - Fix tag

Pull Request - State: closed - Opened by sharockys over 1 year ago

#248 - Sharockys docker action

Pull Request - State: closed - Opened by sharockys over 1 year ago

#247 - Newer version of ElasticSearch API changed a lot

Issue - State: closed - Opened by sharockys over 1 year ago

#246 - Adaptation for recent version of ElasticSearch Python API

Pull Request - State: closed - Opened by sharockys over 1 year ago - 1 comment

#245 - Replace cchardet with the Python 3.11.x compatible faust-cchardet

Pull Request - State: closed - Opened by eliias over 1 year ago

#242 - Scrape by Domain

Issue - State: closed - Opened by firmai over 1 year ago - 1 comment

#241 - Error : You must `download()` an article first!

Issue - State: closed - Opened by PYogesh over 1 year ago - 2 comments

#240 - Update README.md, add descriptions for JSON

Pull Request - State: closed - Opened by Ahacad over 1 year ago

#239 - Specify more recent awscli dependency to avoid dependency resolution issues

Issue - State: closed - Opened by phoerious almost 2 years ago - 8 comments

#238 - DateFilter is never used

Issue - State: closed - Opened by namlede almost 2 years ago - 7 comments

#237 - Failed to build for python 3.11

Issue - State: closed - Opened by mattiasrubenson almost 2 years ago - 3 comments

#236 - Get only the recursive list of URLs using the Library mode

Issue - State: closed - Opened by bakrianoo about 2 years ago - 2 comments

#235 - ModuleNotFoundError: No module named 'newsplease'

Issue - State: closed - Opened by ghost about 2 years ago - 3 comments

#234 - Proxy Server configuration (HttpProxyMiddleware)

Issue - State: open - Opened by bkrishnap over 2 years ago - 4 comments
Labels: help wanted

#231 - news-please at background

Issue - State: closed - Opened by noerarief23 over 2 years ago - 2 comments

#229 - Temporary failure in name resolution

Issue - State: closed - Opened by sara-02 over 2 years ago - 1 comment

#228 - Avoiding restart of commoncrawl scraping process

Issue - State: closed - Opened by joemkwon over 2 years ago

#227 - Fix Elasticsearch package version

Pull Request - State: closed - Opened by bakrianoo over 2 years ago - 1 comment

#225 - Execution neither possible on current Mac OS nor Windows 10

Issue - State: closed - Opened by vivianevv over 2 years ago

#223 - Update s3://commoncrawl/ access scheme

Issue - State: closed - Opened by sebastian-nagel over 2 years ago - 9 comments

#222 - try migrate fsspec

Pull Request - State: closed - Opened by sshleifer over 2 years ago

#219 - Required time by commoncrawl extractor and bug in logging

Issue - State: closed - Opened by lucadiliello about 3 years ago - 1 comment

#218 - Connection doesn't get rollbacked using PostgresqlStorage Pipeline

Issue - State: closed - Opened by flatplate about 3 years ago - 1 comment

#217 - ignore_regex configuration option in config.cfg is not working properly

Issue - State: closed - Opened by marvingabler over 3 years ago - 1 comment

#216 - Refactored MySQLStorage

Pull Request - State: closed - Opened by siglun88 over 3 years ago

#215 - cannot get related maintext

Issue - State: closed - Opened by farzad-845 over 3 years ago - 1 comment

#214 - Added cchardet installation to readme

Pull Request - State: closed - Opened by lodenrogue over 3 years ago - 1 comment

#213 - Article not giving full text

Issue - State: closed - Opened by lodenrogue over 3 years ago - 3 comments

#212 - Add Github Actions build

Pull Request - State: closed - Opened by gliptak over 3 years ago - 5 comments

#211 - Error: slice indices must be integers or None or have an __index__ method

Issue - State: closed - Opened by aljbri over 3 years ago - 1 comment

#210 - Support warc_files_end_date for common crawl crawler.

Pull Request - State: closed - Opened by shangw-nvidia over 3 years ago - 2 comments

#209 - Publish datetime timezone

Issue - State: closed - Opened by dhesru over 3 years ago - 1 comment

#208 - Bypass Paywall with credentials

Issue - State: closed - Opened by maxschaeufele over 3 years ago - 1 comment

#207 - CommonCrawl.py example

Issue - State: closed - Opened by keimiii over 3 years ago - 5 comments

#206 - Commoncrawl.py example NameError

Issue - State: closed - Opened by keimiii over 3 years ago - 1 comment

#205 - Issue with Commoncrawl.py example

Issue - State: closed - Opened by keimiii over 3 years ago - 1 comment

#204 - commoncrawl.py won't filter by host

Issue - State: closed - Opened by WilliamGough almost 4 years ago

#203 - Add option of not fetching images using newspaper library and make default for commoncrawl

Pull Request - State: closed - Opened by frankier almost 4 years ago - 1 comment

#202 - Handle publish_date == "" in newspaper_extractor

Pull Request - State: closed - Opened by frankier almost 4 years ago

#199 - * newsplease/crawler/commoncrawl_crawler.py: use a system temp file

Pull Request - State: closed - Opened by lgov almost 4 years ago - 1 comment

#198 - Add option of replacing unicode decode errors in WARC/common crawl extraction

Pull Request - State: closed - Opened by frankier almost 4 years ago - 4 comments

#197 - Continue detection when LangDetectException in <article>

Pull Request - State: closed - Opened by frankier almost 4 years ago

#196 - Fix longest article detection langdetect

Pull Request - State: closed - Opened by frankier almost 4 years ago

#194 - Fallback to bytes fromstring when lxml unicode fromstring fails

Pull Request - State: closed - Opened by frankier almost 4 years ago

#193 - Make filter_record public to enable subclasses to override

Pull Request - State: closed - Opened by frankier almost 4 years ago - 2 comments

#192 - Update the root URL

Issue - State: closed - Opened by ghost almost 4 years ago - 2 comments

#191 - Fixes #189: cchardet installation

Pull Request - State: closed - Opened by shradhasehgal almost 4 years ago

#189 - May be this is a typo: "import cchardet" instead of "import chardet"

Issue - State: closed - Opened by parrondo almost 4 years ago - 2 comments

#188 - Commoncrawl crawler: use system temp file in a writable folder

Pull Request - State: closed - Opened by lgov almost 4 years ago - 4 comments

#184 - Filter commoncrawl warc files after exact timestamp, not only per year+month

Pull Request - State: closed - Opened by lgov almost 4 years ago

#183 - Add support for Elasticsearch API Key Authentication

Issue - State: closed - Opened by roberto-naharro about 4 years ago - 1 comment

#182 - awscli should be an optional dependency

Issue - State: closed - Opened by rpocase about 4 years ago - 1 comment

#181 - Tags keyword can't crawled

Issue - State: closed - Opened by jugosx about 4 years ago - 1 comment