Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / fhamborg/news-please issues and pull requests
#288 - Adds support for exporting to OpenSearch
Pull Request -
State: open - Opened by wild5r 2 months ago
- 2 comments
#287 - Update documentation
Pull Request -
State: closed - Opened by Medno 2 months ago
- 1 comment
#286 - Correctly extract all Sitemaps urls from robots.txt
Pull Request -
State: closed - Opened by yldoctrine 4 months ago
- 1 comment
#285 - Allow the RssCrawler to search for an RSS feed from the provided url
Pull Request -
State: open - Opened by yldoctrine 4 months ago
- 8 comments
#284 - Make fetching images options for from_url and from_urls methods
Pull Request -
State: closed - Opened by TMCarrier 4 months ago
- 3 comments
#283 - Add missing dependencies to requirements.txt
Pull Request -
State: closed - Opened by jkawamoto 4 months ago
- 1 comment
#282 - Pass optional arguments to requests.get
Pull Request -
State: closed - Opened by jkawamoto 4 months ago
- 6 comments
#281 - Add check_certificate option in configuration to be able to crawl sites not having a valid certificate
Pull Request -
State: closed - Opened by yldoctrine 4 months ago
- 2 comments
#280 - Fix article_extractor to only instantiate the Extractors in the specified file
Pull Request -
State: closed - Opened by yldoctrine 4 months ago
- 1 comment
#279 - Encode empty "html_title" if it's not found
Pull Request -
State: closed - Opened by Medno 5 months ago
- 2 comments
#278 - Bugfixes
Pull Request -
State: closed - Opened by t1h0 5 months ago
#277 - Remove typing_extensions.cast
Pull Request -
State: closed - Opened by t1h0 5 months ago
- 1 comment
#276 - Raise typing-extensions and python version
Pull Request -
State: closed - Opened by t1h0 5 months ago
- 1 comment
#275 - typing_extensions missing in requirements
Issue -
State: closed - Opened by t1h0 5 months ago
- 1 comment
#274 - Add robustness for pages missing title
Pull Request -
State: closed - Opened by yldoctrine 5 months ago
#273 - Remove newspaper3k from setup
Pull Request -
State: closed - Opened by Medno 5 months ago
- 1 comment
#272 - Fix type annotations for earlier Python versions
Pull Request -
State: closed - Opened by Medno 5 months ago
- 2 comments
#271 - Add an additional sitemap check in Sitemap crawlers
Pull Request -
State: closed - Opened by Medno 5 months ago
- 3 comments
#270 - Add 4 date metadata patterns in date extraction
Pull Request -
State: closed - Opened by anteverse 5 months ago
- 2 comments
#269 - Add Redis pipeline
Pull Request -
State: closed - Opened by anteverse 6 months ago
- 5 comments
#268 - Add unique constraint on column `url` on table `CurrentVersions` in Postgres pipeline
Pull Request -
State: closed - Opened by anteverse 6 months ago
- 3 comments
#267 - Cast get_max_url_file_name_length result to int
Pull Request -
State: closed - Opened by Medno 6 months ago
#266 - Allow RssCrawler to fallback if there was an empty RSS feed
Pull Request -
State: closed - Opened by Medno 6 months ago
- 4 comments
#265 - Use SPIDER_MODULES field from configuration file if it's defined
Pull Request -
State: closed - Opened by Medno 6 months ago
- 2 comments
#264 - Use newspaper4k instead of abandoned newspaper3k
Pull Request -
State: closed - Opened by t1h0 6 months ago
- 3 comments
#263 - Use newspaper4k instead of abandoned newspaper3k
Pull Request -
State: closed - Opened by t1h0 6 months ago
- 3 comments
#261 - Add the possibility to define a schema for postgresql database
Pull Request -
State: closed - Opened by yldoctrine 7 months ago
- 7 comments
#260 - Unable to change URLS from example URLS
Issue -
State: closed - Opened by rinaforristal 7 months ago
- 2 comments
#259 - ImportError: libpq.so.5: cannot open shared object file: No such file or directory
Issue -
State: closed - Opened by Pasanlaksitha 9 months ago
- 1 comment
#258 - Reuter news scrip failed
Issue -
State: closed - Opened by pepingreat 9 months ago
- 1 comment
#257 - maintext article attribute length limitation
Issue -
State: closed - Opened by zurek11 about 1 year ago
- 1 comment
#256 - Make SimpleCrawer.fetch_urls() accept custom timeout
Pull Request -
State: closed - Opened by StepinSilence about 1 year ago
- 1 comment
#255 - Implement user agent functionality similar to News Paper 3k
Issue -
State: closed - Opened by GiridharRNair about 1 year ago
#253 - can not extract main text.
Issue -
State: closed - Opened by simplew2011 about 1 year ago
- 1 comment
#252 - fixes #172 and #169: NewsPlease.from_urls() - use multiprocessing
Pull Request -
State: closed - Opened by arcolife over 1 year ago
- 3 comments
#251 - Change Crawlers to RecursiveCrawler with as a library and store to Mongodb
Issue -
State: closed - Opened by Anhduchb01 over 1 year ago
- 1 comment
#250 - Unable to Crawl and Save PDF files
Issue -
State: closed - Opened by simrankaur20 over 1 year ago
- 1 comment
#249 - Fix tag
Pull Request -
State: closed - Opened by sharockys over 1 year ago
#248 - Sharockys docker action
Pull Request -
State: closed - Opened by sharockys over 1 year ago
#247 - Newer version of ElasticSearch API changed a lot
Issue -
State: closed - Opened by sharockys over 1 year ago
#246 - Adaptation for recent version of ElasticSearch Python API
Pull Request -
State: closed - Opened by sharockys over 1 year ago
- 1 comment
#245 - Replace cchardet with the Python 3.11.x compatible faust-cchardet
Pull Request -
State: closed - Opened by eliias over 1 year ago
#244 - `NewsPlease.from_urls` updated to behave consistently with 404 urls.
Pull Request -
State: closed - Opened by loganamcnichols over 1 year ago
#243 - NewsPlease.from_urls behaves inconsistently in situations where a url results in 404
Issue -
State: closed - Opened by loganamcnichols over 1 year ago
#242 - Scrape by Domain
Issue -
State: closed - Opened by firmai over 1 year ago
- 1 comment
#241 - Error : You must `download()` an article first!
Issue -
State: closed - Opened by PYogesh over 1 year ago
- 2 comments
#240 - Update README.md, add descriptions for JSON
Pull Request -
State: closed - Opened by Ahacad over 1 year ago
#239 - Specify more recent awscli dependency to avoid dependency resolution issues
Issue -
State: closed - Opened by phoerious almost 2 years ago
- 8 comments
#238 - DateFilter is never used
Issue -
State: closed - Opened by namlede almost 2 years ago
- 7 comments
#237 - Failed to build for python 3.11
Issue -
State: closed - Opened by mattiasrubenson almost 2 years ago
- 3 comments
#236 - Get only the recursive list of URLs using the Library mode
Issue -
State: closed - Opened by bakrianoo about 2 years ago
- 2 comments
#235 - ModuleNotFoundError: No module named 'newsplease'
Issue -
State: closed - Opened by ghost about 2 years ago
- 3 comments
#234 - Proxy Server configuration (HttpProxyMiddleware)
Issue -
State: open - Opened by bkrishnap over 2 years ago
- 4 comments
Labels: help wanted
#232 - Configure options to optimize the crawling and extraction process
Issue -
State: closed - Opened by kvasilopoulos over 2 years ago
#231 - news-please at background
Issue -
State: closed - Opened by noerarief23 over 2 years ago
- 2 comments
#229 - Temporary failure in name resolution
Issue -
State: closed - Opened by sara-02 over 2 years ago
- 1 comment
#228 - Avoiding restart of commoncrawl scraping process
Issue -
State: closed - Opened by joemkwon over 2 years ago
#227 - Fix Elasticsearch package version
Pull Request -
State: closed - Opened by bakrianoo over 2 years ago
- 1 comment
#226 - Common Crawl crawler: adapt to new data access scheme, fixes #223
Pull Request -
State: closed - Opened by sebastian-nagel over 2 years ago
#225 - Execution neither possible on current Mac OS nor Windows 10
Issue -
State: closed - Opened by vivianevv over 2 years ago
#224 - crawl_from_commoncrawl crashes when attempting to parse the date from warc.paths.gz
Issue -
State: closed - Opened by Loumstar over 2 years ago
- 1 comment
#223 - Update s3://commoncrawl/ access scheme
Issue -
State: closed - Opened by sebastian-nagel over 2 years ago
- 9 comments
#222 - try migrate fsspec
Pull Request -
State: closed - Opened by sshleifer over 2 years ago
#219 - Required time by commoncrawl extractor and bug in logging
Issue -
State: closed - Opened by lucadiliello about 3 years ago
- 1 comment
#218 - Connection doesn't get rollbacked using PostgresqlStorage Pipeline
Issue -
State: closed - Opened by flatplate about 3 years ago
- 1 comment
#217 - ignore_regex configuration option in config.cfg is not working properly
Issue -
State: closed - Opened by marvingabler over 3 years ago
- 1 comment
#216 - Refactored MySQLStorage
Pull Request -
State: closed - Opened by siglun88 over 3 years ago
#215 - cannot get related maintext
Issue -
State: closed - Opened by farzad-845 over 3 years ago
- 1 comment
#214 - Added cchardet installation to readme
Pull Request -
State: closed - Opened by lodenrogue over 3 years ago
- 1 comment
#213 - Article not giving full text
Issue -
State: closed - Opened by lodenrogue over 3 years ago
- 3 comments
#212 - Add Github Actions build
Pull Request -
State: closed - Opened by gliptak over 3 years ago
- 5 comments
#211 - Error: slice indices must be integers or None or have an __index__ method
Issue -
State: closed - Opened by aljbri over 3 years ago
- 1 comment
#210 - Support warc_files_end_date for common crawl crawler.
Pull Request -
State: closed - Opened by shangw-nvidia over 3 years ago
- 2 comments
#209 - Publish datetime timezone
Issue -
State: closed - Opened by dhesru over 3 years ago
- 1 comment
#208 - Bypass Paywall with credentials
Issue -
State: closed - Opened by maxschaeufele over 3 years ago
- 1 comment
#207 - CommonCrawl.py example
Issue -
State: closed - Opened by keimiii over 3 years ago
- 5 comments
#206 - Commoncrawl.py example NameError
Issue -
State: closed - Opened by keimiii over 3 years ago
- 1 comment
#205 - Issue with Commoncrawl.py example
Issue -
State: closed - Opened by keimiii over 3 years ago
- 1 comment
#204 - commoncrawl.py won't filter by host
Issue -
State: closed - Opened by WilliamGough almost 4 years ago
#203 - Add option of not fetching images using newspaper library and make default for commoncrawl
Pull Request -
State: closed - Opened by frankier almost 4 years ago
- 1 comment
#202 - Handle publish_date == "" in newspaper_extractor
Pull Request -
State: closed - Opened by frankier almost 4 years ago
#201 - Filter empty responses from WARC to avoid spurious exceptions from `newspaper`
Pull Request -
State: closed - Opened by frankier almost 4 years ago
#200 - Check for when log_pathname_fully_extracted_warcs is None and don't log in this case
Pull Request -
State: closed - Opened by frankier almost 4 years ago
#199 - * newsplease/crawler/commoncrawl_crawler.py: use a system temp file
Pull Request -
State: closed - Opened by lgov almost 4 years ago
- 1 comment
#198 - Add option of replacing unicode decode errors in WARC/common crawl extraction
Pull Request -
State: closed - Opened by frankier almost 4 years ago
- 4 comments
#197 - Continue detection when LangDetectException in <article>
Pull Request -
State: closed - Opened by frankier almost 4 years ago
#196 - Fix longest article detection langdetect
Pull Request -
State: closed - Opened by frankier almost 4 years ago
#195 - Avoid race condition by using exist_ok=True for makedirs rather than checking for exists first
Pull Request -
State: closed - Opened by frankier almost 4 years ago
- 1 comment
#194 - Fallback to bytes fromstring when lxml unicode fromstring fails
Pull Request -
State: closed - Opened by frankier almost 4 years ago
#193 - Make filter_record public to enable subclasses to override
Pull Request -
State: closed - Opened by frankier almost 4 years ago
- 2 comments
#192 - Update the root URL
Issue -
State: closed - Opened by ghost almost 4 years ago
- 2 comments
#191 - Fixes #189: cchardet installation
Pull Request -
State: closed - Opened by shradhasehgal almost 4 years ago
#189 - May be this is a typo: "import cchardet" instead of "import chardet"
Issue -
State: closed - Opened by parrondo almost 4 years ago
- 2 comments
#188 - Commoncrawl crawler: use system temp file in a writable folder
Pull Request -
State: closed - Opened by lgov almost 4 years ago
- 4 comments
#187 - Adding Postgresql pipeline in config.cfg gives error "psycopg2.ProgrammingError: no results to fetch error" when running crawler
Issue -
State: closed - Opened by ghost almost 4 years ago
- 2 comments
#185 - When crawling whole sites, is there a way to start crawling the latest news rather than old ones?
Issue -
State: closed - Opened by justlike-prog almost 4 years ago
- 2 comments
#184 - Filter commoncrawl warc files after exact timestamp, not only per year+month
Pull Request -
State: closed - Opened by lgov about 4 years ago
#183 - Add support for Elasticsearch API Key Authentication
Issue -
State: closed - Opened by roberto-naharro about 4 years ago
- 1 comment
#182 - awscli should be an optional dependency
Issue -
State: closed - Opened by rpocase about 4 years ago
- 1 comment
#181 - Tags keyword can't crawled
Issue -
State: closed - Opened by jugosx about 4 years ago
- 1 comment