Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / adbar/trafilatura issues and pull requests

#330 - Proxy support to Trafilatura

Issue - State: closed - Opened by andremacola over 1 year ago - 4 comments
Labels: enhancement

#329 - reflect changes in courlan library

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#328 - fix: html with no metadata image

Pull Request - State: closed - Opened by andremacola over 1 year ago - 1 comment

#327 - add is_live test using HTTP HEAD request

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#326 - sitemaps: more efficient processing

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#325 - ValueError: signal only works in main thread

Issue - State: closed - Opened by pandemosth over 1 year ago - 2 comments
Labels: bug

#324 - CLI fix: single URL provided with -u

Pull Request - State: closed - Opened by adbar over 1 year ago

#323 - feat: add basic auth support

Issue - State: closed - Opened by kondounagi over 1 year ago - 4 comments
Labels: feedback

#322 - fetch_url doesn't return RawResponse and doesn't provide access to response code

Issue - State: closed - Opened by edkrueger over 1 year ago - 10 comments
Labels: enhancement, documentation

#321 - #320 - update deprecated code

Pull Request - State: closed - Opened by sdondley over 1 year ago - 3 comments

#319 - 'lxml.etree._Element' object has no attribute 'text_content'

Issue - State: closed - Opened by asjsrep over 1 year ago - 17 comments
Labels: bug, documentation

#318 - Doesnt extract li tags content with an id

Issue - State: closed - Opened by alroythalus over 1 year ago - 5 comments
Labels: bug

#317 - prepare version 1.5.0

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#316 - setup: simplify CI and remove tests for Python 3.12-dev

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#315 - Fix handling of text content of <div> with empty <p> child

Pull Request - State: closed - Opened by knit-bee over 1 year ago - 1 comment

#314 - New content hashes and default file names

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#313 - probe_alternative_homepage no_ssl arg from fetch_url

Issue - State: closed - Opened by hyshandler over 1 year ago - 1 comment
Labels: question

#312 - spider: update setup and adjust

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#311 - build(deps): bump goose3 from 3.1.12 to 3.1.13

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 1 comment
Labels: dependencies

#310 - feat: extract pagetype from og:type or ld+json

Pull Request - State: closed - Opened by andremacola over 1 year ago - 2 comments

#309 - Cannot extract heading correctly in a list

Issue - State: closed - Opened by fortyfourforty over 1 year ago - 5 comments
Labels: bug

#308 - What is the recommended approach to outputting a more readable article

Issue - State: closed - Opened by rbhalla over 1 year ago - 1 comment
Labels: question

#307 - Extract page type with og:type or jd+json

Issue - State: closed - Opened by andremacola over 1 year ago - 4 comments
Labels: enhancement

#306 - Add as_dict method to Document.

Pull Request - State: closed - Opened by edkrueger over 1 year ago - 2 comments

#305 - `cchardet` is recommended now or is replaced by `charset-normalizer` by default?

Issue - State: closed - Opened by lord-alfred over 1 year ago - 2 comments
Labels: feedback

#304 - Unable to extract full text from New Yorker article

Issue - State: closed - Opened by CyberneticTurtle over 1 year ago - 1 comment
Labels: duplicate

#303 - Prevent extract_metadata from failing then @type is an empty list.

Pull Request - State: closed - Opened by edkrueger over 1 year ago - 2 comments

#302 - Method of Sourcing Elements

Issue - State: closed - Opened by jackHedaya almost 2 years ago - 3 comments
Labels: feedback

#301 - Cannot extract table from wordpress gutenberg blocks

Issue - State: closed - Opened by fortyfourforty almost 2 years ago - 4 comments
Labels: bug

#300 - trafilatura as a server

Issue - State: closed - Opened by lord-alfred almost 2 years ago - 4 comments
Labels: duplicate, question

#299 - Empty list value for @type causes extract_metadata to fail

Issue - State: closed - Opened by edkrueger almost 2 years ago - 1 comment
Labels: bug

#298 - extract_metadata() doesn't return dict, but documentation says it does

Issue - State: closed - Opened by edkrueger almost 2 years ago - 3 comments
Labels: enhancement, documentation

#297 - Error on compare_extraction function when no fallback is False

Issue - State: closed - Opened by felipehertzer almost 2 years ago - 3 comments
Labels: question

#296 - Fixed bug on JSON metadata when ld+JSON is formatted wrong

Pull Request - State: closed - Opened by felipehertzer almost 2 years ago - 2 comments

#295 - Add new class to metadata title

Pull Request - State: closed - Opened by felipehertzer almost 2 years ago - 1 comment

#294 - Sourcery refactored master branch

Pull Request - State: closed - Opened by sourcery-ai[bot] almost 2 years ago - 1 comment

#293 - build(deps): bump trafilatura from 1.4.0 to 1.4.1

Pull Request - State: closed - Opened by dependabot[bot] almost 2 years ago - 1 comment
Labels: dependencies

#292 - build(deps): bump beautifulsoup4 from 4.11.1 to 4.11.2

Pull Request - State: closed - Opened by dependabot[bot] almost 2 years ago - 1 comment
Labels: dependencies

#291 - Option to remove unreachable pages and pages not strictly in the same domain

Issue - State: closed - Opened by MTB-nsartor almost 2 years ago - 3 comments
Labels: enhancement

#290 - Collected links as metadata field?

Issue - State: open - Opened by Amaimersion almost 2 years ago - 3 comments
Labels: enhancement

#289 - Fix XPath expression in subtree

Issue - State: open - Opened by adbar almost 2 years ago - 1 comment
Labels: maintenance

#288 - Can't get include_images to include any images

Issue - State: closed - Opened by boxabirds almost 2 years ago - 7 comments
Labels: question

#286 - build(deps): bump inscriptis from 2.3.1 to 2.3.2

Pull Request - State: closed - Opened by dependabot[bot] almost 2 years ago - 3 comments
Labels: dependencies

#284 - Improve title extraction by removing sitename suffix

Issue - State: closed - Opened by andremacola almost 2 years ago - 6 comments
Labels: enhancement

#283 - Remove unwanted html elements with regex or xpaths

Issue - State: closed - Opened by andremacola almost 2 years ago - 8 comments
Labels: question

#282 - feat: Add image urls to metadata

Pull Request - State: closed - Opened by andremacola almost 2 years ago - 12 comments

#281 - Add image urls to metadata

Issue - State: closed - Opened by andremacola almost 2 years ago - 1 comment
Labels: enhancement

#280 - setup: use faust-cchardet from 3.10 onwards

Pull Request - State: closed - Opened by adbar almost 2 years ago

#279 - fix setup (2)

Pull Request - State: closed - Opened by adbar almost 2 years ago

#278 - setup: try to fix actions

Pull Request - State: closed - Opened by adbar almost 2 years ago

#277 - Fix setup for oldest and newest Python versions

Pull Request - State: closed - Opened by adbar almost 2 years ago - 1 comment

#276 - improved cli and gui

Pull Request - State: closed - Opened by wu-seong almost 2 years ago - 1 comment

#275 - Fix for failing tests

Pull Request - State: closed - Opened by knit-bee almost 2 years ago

#274 - TEI: Nesting of <ab> elements

Pull Request - State: closed - Opened by knit-bee almost 2 years ago

#273 - Remove double tags in XML output

Pull Request - State: closed - Opened by knit-bee almost 2 years ago - 1 comment

#272 - Extraction of Youtube iframes and img elements with links

Issue - State: open - Opened by sampathmende almost 2 years ago - 3 comments
Labels: enhancement

#271 - PytzUsageWarning: localize method no longer necessary

Issue - State: closed - Opened by rwinterschlaf almost 2 years ago - 2 comments
Labels: question

#270 - 403 for URL for Amazon

Issue - State: closed - Opened by mirfan899 almost 2 years ago - 1 comment
Labels: question

#269 - Fixes to Emoji Regexp

Pull Request - State: closed - Opened by felipehertzer about 2 years ago - 1 comment

#268 - Html extraction

Issue - State: closed - Opened by slavaGanzin about 2 years ago - 3 comments
Labels: question

#267 - author regexes: review ranges (#266)

Pull Request - State: closed - Opened by adbar about 2 years ago - 1 comment

#266 - Fix code scanning alert - Overly permissive regular expression range

Issue - State: closed - Opened by adbar about 2 years ago - 1 comment

#263 - Endless reading for link / timeout possible?

Issue - State: closed - Opened by Rapid1898-code about 2 years ago - 7 comments
Labels: enhancement

#261 - CLI arguments inconsistent: --inputfile and --inputdir

Issue - State: closed - Opened by adbar about 2 years ago - 1 comment
Labels: good first issue, up for grabs

#259 - Added the possibility to prune custom path's

Pull Request - State: closed - Opened by HeLehm about 2 years ago - 2 comments

#254 - TEI conformity: improve divs and element tails, fw → ab

Pull Request - State: closed - Opened by knit-bee about 2 years ago - 7 comments

#253 - TEI: Handle invalid siblings of <div>

Pull Request - State: closed - Opened by knit-bee about 2 years ago - 4 comments

#232 - Defer URL management to courlan.UrlStore (experimental)

Pull Request - State: closed - Opened by adbar over 2 years ago - 4 comments

#231 - Trafilatura appears to ignore <meta charset="...">

Issue - State: closed - Opened by zackw over 2 years ago - 3 comments
Labels: question, feedback

#229 - Keep orderedness information of lists

Issue - State: closed - Opened by DavidNemeskey over 2 years ago - 4 comments
Labels: feedback

#225 - Add argument to use archive.org as a backup in fetch_url()

Issue - State: closed - Opened by vprelovac over 2 years ago - 12 comments
Labels: enhancement, documentation

#224 - Add document language to metadata

Issue - State: open - Opened by adbar over 2 years ago - 6 comments
Labels: enhancement

#216 - Memory leaks

Issue - State: closed - Opened by kinoute over 2 years ago - 6 comments
Labels: bug

#215 - Question: JSON for Linking Data

Issue - State: closed - Opened by Lucabenj over 2 years ago - 6 comments
Labels: question, feedback

#202 - Celery error with v1.2.1: ValueError: signal only works in main thread

Issue - State: closed - Opened by alex-bender over 2 years ago - 17 comments
Labels: feedback

#197 - Added Coinbase article annotation

Pull Request - State: closed - Opened by swetepete over 2 years ago - 3 comments

#195 - Extend test coverage for json_metadata functions

Issue - State: closed - Opened by adbar over 2 years ago - 5 comments
Labels: feedback

#175 - Add include_video parameter (iframe elements are missing)

Issue - State: open - Opened by fraseInc over 2 years ago - 9 comments
Labels: enhancement

#166 - Issue with LXML on M1 / Apple arm64 platforms

Issue - State: closed - Opened by naftalibeder almost 3 years ago - 9 comments
Labels: bug, wontfix, documentation

#158 - xml extraction leads to <graphic> tags in the wrong place.

Issue - State: closed - Opened by joschu almost 3 years ago - 5 comments
Labels: bug

#151 - CLI: run as server

Issue - State: closed - Opened by adbar almost 3 years ago - 4 comments
Labels: enhancement

#148 - Interaction with internet archives (API and formats)

Issue - State: closed - Opened by adbar almost 3 years ago - 2 comments
Labels: enhancement

#147 - anchor issue

Issue - State: closed - Opened by pieterhartel almost 3 years ago - 5 comments
Labels: bug, wontfix

#122 - CLI: improve usability for large number of downloads

Issue - State: closed - Opened by adbar about 3 years ago
Labels: enhancement

#116 - Investigate accuracy on Polish and Russian websites?

Issue - State: closed - Opened by adbar about 3 years ago - 1 comment
Labels: question

#113 - Unexpected lack of whitespace before/after ref tags in XML output

Issue - State: closed - Opened by adri1wald about 3 years ago - 4 comments
Labels: bug

#105 - Extract content from formats other than HTML: PDF, EPUB?

Issue - State: closed - Opened by adbar over 3 years ago - 9 comments
Labels: enhancement, feedback

#99 - Parse JSON-LD information and write heuristics to decide where to draw info from

Issue - State: closed - Opened by adbar over 3 years ago - 3 comments
Labels: enhancement

#89 - Graphic tag with no src attribute in XML result

Issue - State: closed - Opened by phongtnit over 3 years ago - 2 comments
Labels: bug

#80 - Teaser with link in article flow

Issue - State: closed - Opened by adbar over 3 years ago - 1 comment
Labels: enhancement

#57 - Is there a way to extract a top image from an article?

Issue - State: closed - Opened by ArturasDruteika over 3 years ago - 12 comments
Labels: bug

#53 - Bypass catchas/cookies/consent windows?

Issue - State: closed - Opened by adbar almost 4 years ago - 3 comments
Labels: feedback

#37 - Investigate potential speed-up with customized readability-lxml

Issue - State: closed - Opened by adbar almost 4 years ago - 1 comment
Labels: enhancement

#4 - List of smaller extraction bugs (text & metadata)

Issue - State: open - Opened by adbar almost 5 years ago - 30 comments
Labels: good first issue, up for grabs

#3 - Thoroughly implement and test duplicate detection

Issue - State: closed - Opened by adbar almost 5 years ago - 2 comments
Labels: enhancement