Text encoding detector

Make the unit tests pass, or at the very least make it obvious that the tests are actually failing (instead of ignoring the failures like our current Travis build does).If our goal here is to be a truly universal character encoding detector, we'll need to go our own way in the future in that respect. Basically, Mozilla seems very likely to abandon their character encoding detector in the near future and switch to using ICU, but ICU doesn't support all of the codecs we currently do, because it is more web-focused. The previous maintainers tried to keep variable names identical to the C code, presumably to ease the comparison with the Mozilla code, but we're going to be diverging from upstream after pulling in the changes mentioned in 1. Improve PEP8 compliance all over the place.There aren't as many as I had initially expected but there are some. Pull in changes from Mozilla's upstream code.This is very much a work in progress at the moment, and I'm just creating the PR to make it easier for me to keep track of Travis results. Anecdotally all of the bugs that have been found so far don't depend on the length and min_size=1 would have been fine (leaving min_size alone is also valid, but I assume '' having a None encoding is intended behaviour) min_size=100 is to rule out bugs that come solely from the length, prompted by your saying that short strings aren't really supported.I'm pretty sure this is because of issues it's finding in the code, not issues with the test. #62 had one extra line in it to try to reencode the data as the reported format. This is (more or less) the test that caught #65, #64 and #63. That even this test is demonstrating a bunch of bugs this seemed to Should be decodable from the detected encoding) but given

Invariant that if a string comes from valid unicode and isĮncoded in one of the chardet supported encodings then chardet The concept here is pretty simple: This tries to test for the Print('%s left %.2f MB of unfreed memory.' % (desc, mem_used)) Html = open("mem_leak_html.txt", "rb").read() Mem_use = lambda: resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024 # Bad input left 312.00 MB of unfreed memory. # Good input left 3.00 MB of unfreed memory. # produces: 10 loops, best of 3: 111 sec per loop # produces: 10 loops, best of 3: 41.7 ms per loop Python -m timeit -s "$setup" 'tect(html)' # Bad input left 220.16 MB of unfreed memory. # Good input left 2.65 MB of unfreed memory. # produces: 1 loops, best of 3: 1min 22s per loop # produces: 10 loops, best of 3: 43 ms per loop Python3 -m timeit -s "$setup" 'tect(html)' Here is an overview of the content and the results: setup='import chardet html = open("mem_leak_html.txt", "rb").read()'

I cannot attach any files to this issue so I uploaded them to my dropbox account: Please let me know of a better place where to put it if necessary. It seems not to be limited to python3, in python2 the memory consumption is even worse (312 MB). It consumes on my machine about 220 MB of memory (however, the input is 2.5 MB) and takes about 1:22 minutes to process (in contrast to 43 ms when the file is truncated to about 2 MB). I narrowed down the problem to a single call of tect() method for certain web pages.Īfter some testing, it seems that chardet has problem with some special input and I managed to get a sample of such an input. I noticed that over time (many hours), the program consumes all memory.

I am using chardet as part of a web crawler written in python3.