PNG  IHDRX cHRMz&u0`:pQ<bKGD pHYsodtIME MeqIDATxw]Wug^Qd˶ 6`!N:!@xI~)%7%@Bh&`lnjVF29gΨ4E$|>cɚ{gk= %,a KX%,a KX%,a KX%,a KX%,a KX%,a KX%, b` ǟzeאfp]<!SJmɤY޲ڿ,%c ~ع9VH.!Ͳz&QynֺTkRR.BLHi٪:l;@(!MԴ=žI,:o&N'Kù\vRmJ雵֫AWic H@" !: Cé||]k-Ha oݜ:y F())u]aG7*JV@J415p=sZH!=!DRʯvɱh~V\}v/GKY$n]"X"}t@ xS76^[bw4dsce)2dU0 CkMa-U5tvLƀ~mlMwfGE/-]7XAƟ`׮g ewxwC4\[~7@O-Q( a*XGƒ{ ՟}$_y3tĐƤatgvێi|K=uVyrŲlLӪuܿzwk$m87k( `múcE)"@rK( z4$D; 2kW=Xb$V[Ru819קR~qloѱDyįݎ*mxw]y5e4K@ЃI0A D@"BDk_)N\8͜9dz"fK0zɿvM /.:2O{ Nb=M=7>??Zuo32 DLD@D| &+֎C #B8ַ`bOb $D#ͮҪtx]%`ES`Ru[=¾!@Od37LJ0!OIR4m]GZRJu$‡c=%~s@6SKy?CeIh:[vR@Lh | (BhAMy=݃  G"'wzn޺~8ԽSh ~T*A:xR[ܹ?X[uKL_=fDȊ؂p0}7=D$Ekq!/t.*2ʼnDbŞ}DijYaȲ(""6HA;:LzxQ‘(SQQ}*PL*fc\s `/d'QXW, e`#kPGZuŞuO{{wm[&NBTiiI0bukcA9<4@SӊH*؎4U/'2U5.(9JuDfrޱtycU%j(:RUbArLֺN)udA':uGQN"-"Is.*+k@ `Ojs@yU/ H:l;@yyTn}_yw!VkRJ4P)~y#)r,D =ě"Q]ci'%HI4ZL0"MJy 8A{ aN<8D"1#IJi >XjX֔#@>-{vN!8tRݻ^)N_╗FJEk]CT՟ YP:_|H1@ CBk]yKYp|og?*dGvzنzӴzjֺNkC~AbZƷ`.H)=!QͷVTT(| u78y֮}|[8-Vjp%2JPk[}ԉaH8Wpqhwr:vWª<}l77_~{s۴V+RCģ%WRZ\AqHifɤL36: #F:p]Bq/z{0CU6ݳEv_^k7'>sq*+kH%a`0ԣisqにtү04gVgW΂iJiS'3w.w}l6MC2uԯ|>JF5`fV5m`Y**Db1FKNttu]4ccsQNnex/87+}xaUW9y>ͯ骵G{䩓Գ3+vU}~jJ.NFRD7<aJDB1#ҳgSb,+CS?/ VG J?|?,2#M9}B)MiE+G`-wo߫V`fio(}S^4e~V4bHOYb"b#E)dda:'?}׮4繏`{7Z"uny-?ǹ;0MKx{:_pÚmFמ:F " .LFQLG)Q8qN q¯¯3wOvxDb\. BKD9_NN &L:4D{mm o^tֽ:q!ƥ}K+<"m78N< ywsard5+вz~mnG)=}lYݧNj'QJS{S :UYS-952?&O-:W}(!6Mk4+>A>j+i|<<|;ر^߉=HE|V#F)Emm#}/"y GII웻Jі94+v뾧xu~5C95~ūH>c@덉pʃ1/4-A2G%7>m;–Y,cyyaln" ?ƻ!ʪ<{~h~i y.zZB̃/,雋SiC/JFMmBH&&FAbϓO^tubbb_hZ{_QZ-sύodFgO(6]TJA˯#`۶ɟ( %$&+V'~hiYy>922 Wp74Zkq+Ovn錄c>8~GqܲcWꂎz@"1A.}T)uiW4="jJ2W7mU/N0gcqܗOO}?9/wìXžΏ0 >֩(V^Rh32!Hj5`;O28؇2#ݕf3 ?sJd8NJ@7O0 b־?lldщ̡&|9C.8RTWwxWy46ah嘦mh٤&l zCy!PY?: CJyв]dm4ǜҐR޻RլhX{FƯanшQI@x' ao(kUUuxW_Ñ줮[w8 FRJ(8˼)_mQ _!RJhm=!cVmm ?sFOnll6Qk}alY}; "baӌ~M0w,Ggw2W:G/k2%R,_=u`WU R.9T"v,<\Ik޽/2110Ӿxc0gyC&Ny޽JҢrV6N ``یeA16"J³+Rj*;BϜkZPJaÍ<Jyw:NP8/D$ 011z֊Ⱳ3ι֘k1V_"h!JPIΣ'ɜ* aEAd:ݺ>y<}Lp&PlRfTb1]o .2EW\ͮ]38؋rTJsǏP@芎sF\> P^+dYJLbJ C-xϐn> ι$nj,;Ǖa FU *择|h ~izť3ᤓ`K'-f tL7JK+vf2)V'-sFuB4i+m+@My=O҈0"|Yxoj,3]:cо3 $#uŘ%Y"y죯LebqtҢVzq¼X)~>4L׶m~[1_k?kxֺQ`\ |ٛY4Ѯr!)N9{56(iNq}O()Em]=F&u?$HypWUeB\k]JɩSع9 Zqg4ZĊo oMcjZBU]B\TUd34ݝ~:7ڶSUsB0Z3srx 7`:5xcx !qZA!;%͚7&P H<WL!džOb5kF)xor^aujƍ7 Ǡ8/p^(L>ὴ-B,{ۇWzֺ^k]3\EE@7>lYBȝR.oHnXO/}sB|.i@ɥDB4tcm,@ӣgdtJ!lH$_vN166L__'Z)y&kH;:,Y7=J 9cG) V\hjiE;gya~%ks_nC~Er er)muuMg2;֫R)Md) ,¶ 2-wr#F7<-BBn~_(o=KO㭇[Xv eN_SMgSҐ BS헃D%g_N:/pe -wkG*9yYSZS.9cREL !k}<4_Xs#FmҶ:7R$i,fi!~' # !6/S6y@kZkZcX)%5V4P]VGYq%H1!;e1MV<!ϐHO021Dp= HMs~~a)ަu7G^];git!Frl]H/L$=AeUvZE4P\.,xi {-~p?2b#amXAHq)MWǾI_r`S Hz&|{ +ʖ_= (YS(_g0a03M`I&'9vl?MM+m~}*xT۲(fY*V4x@29s{DaY"toGNTO+xCAO~4Ϳ;p`Ѫ:>Ҵ7K 3}+0 387x\)a"/E>qpWB=1 ¨"MP(\xp߫́A3+J] n[ʼnӼaTbZUWb={~2ooKױӰp(CS\S筐R*JغV&&"FA}J>G֐p1ٸbk7 ŘH$JoN <8s^yk_[;gy-;߉DV{c B yce% aJhDȶ 2IdйIB/^n0tNtџdcKj4϶v~- CBcgqx9= PJ) dMsjpYB] GD4RDWX +h{y`,3ꊕ$`zj*N^TP4L:Iz9~6s) Ga:?y*J~?OrMwP\](21sZUD ?ܟQ5Q%ggW6QdO+\@ ̪X'GxN @'4=ˋ+*VwN ne_|(/BDfj5(Dq<*tNt1х!MV.C0 32b#?n0pzj#!38}޴o1KovCJ`8ŗ_"]] rDUy޲@ Ȗ-;xџ'^Y`zEd?0„ DAL18IS]VGq\4o !swV7ˣι%4FѮ~}6)OgS[~Q vcYbL!wG3 7띸*E Pql8=jT\꘿I(z<[6OrR8ºC~ډ]=rNl[g|v TMTղb-o}OrP^Q]<98S¤!k)G(Vkwyqyr޽Nv`N/e p/~NAOk \I:G6]4+K;j$R:Mi #*[AȚT,ʰ,;N{HZTGMoּy) ]%dHء9Պ䠬|<45,\=[bƟ8QXeB3- &dҩ^{>/86bXmZ]]yޚN[(WAHL$YAgDKp=5GHjU&99v簪C0vygln*P)9^͞}lMuiH!̍#DoRBn9l@ xA/_v=ȺT{7Yt2N"4!YN`ae >Q<XMydEB`VU}u]嫇.%e^ánE87Mu\t`cP=AD/G)sI"@MP;)]%fH9'FNsj1pVhY&9=0pfuJ&gޤx+k:!r˭wkl03׼Ku C &ѓYt{.O.zҏ z}/tf_wEp2gvX)GN#I ݭ߽v/ .& и(ZF{e"=V!{zW`, ]+LGz"(UJp|j( #V4, 8B 0 9OkRrlɱl94)'VH9=9W|>PS['G(*I1==C<5"Pg+x'K5EMd؞Af8lG ?D FtoB[je?{k3zQ vZ;%Ɠ,]E>KZ+T/ EJxOZ1i #T<@ I}q9/t'zi(EMqw`mYkU6;[t4DPeckeM;H}_g pMww}k6#H㶏+b8雡Sxp)&C $@'b,fPߑt$RbJ'vznuS ~8='72_`{q纶|Q)Xk}cPz9p7O:'|G~8wx(a 0QCko|0ASD>Ip=4Q, d|F8RcU"/KM opKle M3#i0c%<7׿p&pZq[TR"BpqauIp$ 8~Ĩ!8Սx\ւdT>>Z40ks7 z2IQ}ItԀ<-%S⍤};zIb$I 5K}Q͙D8UguWE$Jh )cu4N tZl+[]M4k8֦Zeq֮M7uIqG 1==tLtR,ƜSrHYt&QP윯Lg' I,3@P'}'R˪e/%-Auv·ñ\> vDJzlӾNv5:|K/Jb6KI9)Zh*ZAi`?S {aiVDԲuy5W7pWeQJk֤#5&V<̺@/GH?^τZL|IJNvI:'P=Ϛt"¨=cud S Q.Ki0 !cJy;LJR;G{BJy޺[^8fK6)=yʊ+(k|&xQ2`L?Ȓ2@Mf 0C`6-%pKpm')c$׻K5[J*U[/#hH!6acB JA _|uMvDyk y)6OPYjœ50VT K}cǻP[ $:]4MEA.y)|B)cf-A?(e|lɉ#P9V)[9t.EiQPDѠ3ϴ;E:+Օ t ȥ~|_N2,ZJLt4! %ա]u {+=p.GhNcŞQI?Nd'yeh n7zi1DB)1S | S#ًZs2|Ɛy$F SxeX{7Vl.Src3E℃Q>b6G ўYCmtկ~=K0f(=LrAS GN'ɹ9<\!a`)֕y[uՍ[09` 9 +57ts6}b4{oqd+J5fa/,97J#6yν99mRWxJyѡyu_TJc`~W>l^q#Ts#2"nD1%fS)FU w{ܯ R{ ˎ󅃏џDsZSQS;LV;7 Od1&1n$ N /.q3~eNɪ]E#oM~}v֯FڦwyZ=<<>Xo稯lfMFV6p02|*=tV!c~]fa5Y^Q_WN|Vs 0ҘދU97OI'N2'8N֭fgg-}V%y]U4 峧p*91#9U kCac_AFңĪy뚇Y_AiuYyTTYЗ-(!JFLt›17uTozc. S;7A&&<ԋ5y;Ro+:' *eYJkWR[@F %SHWP 72k4 qLd'J "zB6{AC0ƁA6U.'F3:Ȅ(9ΜL;D]m8ڥ9}dU "v!;*13Rg^fJyShyy5auA?ɩGHRjo^]׽S)Fm\toy 4WQS@mE#%5ʈfFYDX ~D5Ϡ9tE9So_aU4?Ѽm%&c{n>.KW1Tlb}:j uGi(JgcYj0qn+>) %\!4{LaJso d||u//P_y7iRJ߬nHOy) l+@$($VFIQ9%EeKʈU. ia&FY̒mZ=)+qqoQn >L!qCiDB;Y<%} OgBxB!ØuG)WG9y(Ą{_yesuZmZZey'Wg#C~1Cev@0D $a@˲(.._GimA:uyw֬%;@!JkQVM_Ow:P.s\)ot- ˹"`B,e CRtaEUP<0'}r3[>?G8xU~Nqu;Wm8\RIkբ^5@k+5(By'L&'gBJ3ݶ!/㮻w҅ yqPWUg<e"Qy*167΃sJ\oz]T*UQ<\FԎ`HaNmڜ6DysCask8wP8y9``GJ9lF\G g's Nn͵MLN֪u$| /|7=]O)6s !ĴAKh]q_ap $HH'\1jB^s\|- W1:=6lJBqjY^LsPk""`]w)󭃈,(HC ?䔨Y$Sʣ{4Z+0NvQkhol6C.婧/u]FwiVjZka&%6\F*Ny#8O,22+|Db~d ~Çwc N:FuuCe&oZ(l;@ee-+Wn`44AMK➝2BRՈt7g*1gph9N) *"TF*R(#'88pm=}X]u[i7bEc|\~EMn}P瘊J)K.0i1M6=7'_\kaZ(Th{K*GJyytw"IO-PWJk)..axӝ47"89Cc7ĐBiZx 7m!fy|ϿF9CbȩV 9V-՛^pV̌ɄS#Bv4-@]Vxt-Z, &ֺ*diؠ2^VXbs֔Ìl.jQ]Y[47gj=幽ex)A0ip׳ W2[ᎇhuE^~q흙L} #-b۸oFJ_QP3r6jr+"nfzRJTUqoaۍ /$d8Mx'ݓ= OՃ| )$2mcM*cЙj}f };n YG w0Ia!1Q.oYfr]DyISaP}"dIӗթO67jqR ҊƐƈaɤGG|h;t]䗖oSv|iZqX)oalv;۩meEJ\!8=$4QU4Xo&VEĊ YS^E#d,yX_> ۘ-e\ "Wa6uLĜZi`aD9.% w~mB(02G[6y.773a7 /=o7D)$Z 66 $bY^\CuP. (x'"J60׿Y:Oi;F{w佩b+\Yi`TDWa~|VH)8q/=9!g߆2Y)?ND)%?Ǐ`k/sn:;O299yB=a[Ng 3˲N}vLNy;*?x?~L&=xyӴ~}q{qE*IQ^^ͧvü{Huu=R|>JyUlZV, B~/YF!Y\u_ݼF{_C)LD]m {H 0ihhadd nUkf3oٺCvE\)QJi+֥@tDJkB$1!Đr0XQ|q?d2) Ӣ_}qv-< FŊ߫%roppVBwü~JidY4:}L6M7f٬F "?71<2#?Jyy4뷢<_a7_=Q E=S1И/9{+93֮E{ǂw{))?maÆm(uLE#lïZ  ~d];+]h j?!|$F}*"4(v'8s<ŏUkm7^7no1w2ؗ}TrͿEk>p'8OB7d7R(A 9.*Mi^ͳ; eeUwS+C)uO@ =Sy]` }l8^ZzRXj[^iUɺ$tj))<sbDJfg=Pk_{xaKo1:-uyG0M ԃ\0Lvuy'ȱc2Ji AdyVgVh!{]/&}}ċJ#%d !+87<;qN޼Nفl|1N:8ya  8}k¾+-$4FiZYÔXk*I&'@iI99)HSh4+2G:tGhS^繿 Kتm0 вDk}֚+QT4;sC}rՅE,8CX-e~>G&'9xpW,%Fh,Ry56Y–hW-(v_,? ; qrBk4-V7HQ;ˇ^Gv1JVV%,ik;D_W!))+BoS4QsTM;gt+ndS-~:11Sgv!0qRVh!"Ȋ(̦Yl.]PQWgٳE'`%W1{ndΗBk|Ž7ʒR~,lnoa&:ü$ 3<a[CBݮwt"o\ePJ=Hz"_c^Z.#ˆ*x z̝grY]tdkP*:97YľXyBkD4N.C_[;F9`8& !AMO c `@BA& Ost\-\NX+Xp < !bj3C&QL+*&kAQ=04}cC!9~820G'PC9xa!w&bo_1 Sw"ܱ V )Yl3+ס2KoXOx]"`^WOy :3GO0g;%Yv㐫(R/r (s } u B &FeYZh0y> =2<Ϟc/ -u= c&׭,.0"g"7 6T!vl#sc>{u/Oh Bᾈ)۴74]x7 gMӒ"d]U)}" v4co[ ɡs 5Gg=XR14?5A}D "b{0$L .\4y{_fe:kVS\\O]c^W52LSBDM! C3Dhr̦RtArx4&agaN3Cf<Ԉp4~ B'"1@.b_/xQ} _߃҉/gٓ2Qkqp0շpZ2fԫYz< 4L.Cyυι1t@鎫Fe sYfsF}^ V}N<_`p)alٶ "(XEAVZ<)2},:Ir*#m_YӼ R%a||EƼIJ,,+f"96r/}0jE/)s)cjW#w'Sʯ5<66lj$a~3Kʛy 2:cZ:Yh))+a߭K::N,Q F'qB]={.]h85C9cr=}*rk?vwV렵ٸW Rs%}rNAkDv|uFLBkWY YkX מ|)1!$#3%y?pF<@<Rr0}: }\J [5FRxY<9"SQdE(Q*Qʻ)q1E0B_O24[U'],lOb ]~WjHޏTQ5Syu wq)xnw8~)c 쫬gٲߠ H% k5dƝk> kEj,0% b"vi2Wس_CuK)K{n|>t{P1򨾜j>'kEkƗBg*H%'_aY6Bn!TL&ɌOb{c`'d^{t\i^[uɐ[}q0lM˕G:‚4kb祔c^:?bpg… +37stH:0}en6x˟%/<]BL&* 5&fK9Mq)/iyqtA%kUe[ڛKN]Ě^,"`/ s[EQQm?|XJ߅92m]G.E΃ח U*Cn.j_)Tѧj̿30ڇ!A0=͜ar I3$C^-9#|pk!)?7.x9 @OO;WƝZBFU keZ75F6Tc6"ZȚs2y/1 ʵ:u4xa`C>6Rb/Yм)^=+~uRd`/|_8xbB0?Ft||Z\##|K 0>>zxv8۴吅q 8ĥ)"6>~\8:qM}#͚'ĉ#p\׶ l#bA?)|g g9|8jP(cr,BwV (WliVxxᡁ@0Okn;ɥh$_ckCgriv}>=wGzβ KkBɛ[˪ !J)h&k2%07δt}!d<9;I&0wV/ v 0<H}L&8ob%Hi|޶o&h1L|u֦y~󛱢8fٲUsւ)0oiFx2}X[zVYr_;N(w]_4B@OanC?gĦx>мgx>ΛToZoOMp>40>V Oy V9iq!4 LN,ˢu{jsz]|"R޻&'ƚ{53ўFu(<٪9:΋]B;)B>1::8;~)Yt|0(pw2N%&X,URBK)3\zz&}ax4;ǟ(tLNg{N|Ǽ\G#C9g$^\}p?556]/RP.90 k,U8/u776s ʪ_01چ|\N 0VV*3H鴃J7iI!wG_^ypl}r*jɤSR 5QN@ iZ#1ٰy;_\3\BQQ x:WJv츟ٯ$"@6 S#qe딇(/P( Dy~TOϻ<4:-+F`0||;Xl-"uw$Цi󼕝mKʩorz"mϺ$F:~E'ҐvD\y?Rr8_He@ e~O,T.(ފR*cY^m|cVR[8 JҡSm!ΆԨb)RHG{?MpqrmN>߶Y)\p,d#xۆWY*,l6]v0h15M˙MS8+EdI='LBJIH7_9{Caз*Lq,dt >+~ّeʏ?xԕ4bBAŚjﵫ!'\Ը$WNvKO}ӽmSşذqsOy?\[,d@'73'j%kOe`1.g2"e =YIzS2|zŐƄa\U,dP;jhhhaxǶ?КZ՚.q SE+XrbOu%\GتX(H,N^~]JyEZQKceTQ]VGYqnah;y$cQahT&QPZ*iZ8UQQM.qo/T\7X"u?Mttl2Xq(IoW{R^ ux*SYJ! 4S.Jy~ BROS[V|žKNɛP(L6V^|cR7i7nZW1Fd@ Ara{詑|(T*dN]Ko?s=@ |_EvF]׍kR)eBJc" MUUbY6`~V޴dJKß&~'d3i WWWWWW
Current Directory: /usr/lib/python3.6/site-packages/pip/_vendor/html5lib
Viewing File: /usr/lib/python3.6/site-packages/pip/_vendor/html5lib/_inputstream.py
from __future__ import absolute_import, division, unicode_literals from pip._vendor.six import text_type, binary_type from pip._vendor.six.moves import http_client, urllib import codecs import re from pip._vendor import webencodings from .constants import EOF, spaceCharacters, asciiLetters, asciiUppercase from .constants import ReparseException from . import _utils from io import StringIO try: from io import BytesIO except ImportError: BytesIO = StringIO # Non-unicode versions of constants for use in the pre-parser spaceCharactersBytes = frozenset([item.encode("ascii") for item in spaceCharacters]) asciiLettersBytes = frozenset([item.encode("ascii") for item in asciiLetters]) asciiUppercaseBytes = frozenset([item.encode("ascii") for item in asciiUppercase]) spacesAngleBrackets = spaceCharactersBytes | frozenset([b">", b"<"]) invalid_unicode_no_surrogate = "[\u0001-\u0008\u000B\u000E-\u001F\u007F-\u009F\uFDD0-\uFDEF\uFFFE\uFFFF\U0001FFFE\U0001FFFF\U0002FFFE\U0002FFFF\U0003FFFE\U0003FFFF\U0004FFFE\U0004FFFF\U0005FFFE\U0005FFFF\U0006FFFE\U0006FFFF\U0007FFFE\U0007FFFF\U0008FFFE\U0008FFFF\U0009FFFE\U0009FFFF\U000AFFFE\U000AFFFF\U000BFFFE\U000BFFFF\U000CFFFE\U000CFFFF\U000DFFFE\U000DFFFF\U000EFFFE\U000EFFFF\U000FFFFE\U000FFFFF\U0010FFFE\U0010FFFF]" # noqa if _utils.supports_lone_surrogates: # Use one extra step of indirection and create surrogates with # eval. Not using this indirection would introduce an illegal # unicode literal on platforms not supporting such lone # surrogates. assert invalid_unicode_no_surrogate[-1] == "]" and invalid_unicode_no_surrogate.count("]") == 1 invalid_unicode_re = re.compile(invalid_unicode_no_surrogate[:-1] + eval('"\\uD800-\\uDFFF"') + # pylint:disable=eval-used "]") else: invalid_unicode_re = re.compile(invalid_unicode_no_surrogate) non_bmp_invalid_codepoints = set([0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, 0x10FFFF]) ascii_punctuation_re = re.compile("[\u0009-\u000D\u0020-\u002F\u003A-\u0040\u005B-\u0060\u007B-\u007E]") # Cache for charsUntil() charsUntilRegEx = {} class BufferedStream(object): """Buffering for streams that do not have buffering of their own The buffer is implemented as a list of chunks on the assumption that joining many strings will be slow since it is O(n**2) """ def __init__(self, stream): self.stream = stream self.buffer = [] self.position = [-1, 0] # chunk number, offset def tell(self): pos = 0 for chunk in self.buffer[:self.position[0]]: pos += len(chunk) pos += self.position[1] return pos def seek(self, pos): assert pos <= self._bufferedBytes() offset = pos i = 0 while len(self.buffer[i]) < offset: offset -= len(self.buffer[i]) i += 1 self.position = [i, offset] def read(self, bytes): if not self.buffer: return self._readStream(bytes) elif (self.position[0] == len(self.buffer) and self.position[1] == len(self.buffer[-1])): return self._readStream(bytes) else: return self._readFromBuffer(bytes) def _bufferedBytes(self): return sum([len(item) for item in self.buffer]) def _readStream(self, bytes): data = self.stream.read(bytes) self.buffer.append(data) self.position[0] += 1 self.position[1] = len(data) return data def _readFromBuffer(self, bytes): remainingBytes = bytes rv = [] bufferIndex = self.position[0] bufferOffset = self.position[1] while bufferIndex < len(self.buffer) and remainingBytes != 0: assert remainingBytes > 0 bufferedData = self.buffer[bufferIndex] if remainingBytes <= len(bufferedData) - bufferOffset: bytesToRead = remainingBytes self.position = [bufferIndex, bufferOffset + bytesToRead] else: bytesToRead = len(bufferedData) - bufferOffset self.position = [bufferIndex, len(bufferedData)] bufferIndex += 1 rv.append(bufferedData[bufferOffset:bufferOffset + bytesToRead]) remainingBytes -= bytesToRead bufferOffset = 0 if remainingBytes: rv.append(self._readStream(remainingBytes)) return b"".join(rv) def HTMLInputStream(source, **kwargs): # Work around Python bug #20007: read(0) closes the connection. # http://bugs.python.org/issue20007 if (isinstance(source, http_client.HTTPResponse) or # Also check for addinfourl wrapping HTTPResponse (isinstance(source, urllib.response.addbase) and isinstance(source.fp, http_client.HTTPResponse))): isUnicode = False elif hasattr(source, "read"): isUnicode = isinstance(source.read(0), text_type) else: isUnicode = isinstance(source, text_type) if isUnicode: encodings = [x for x in kwargs if x.endswith("_encoding")] if encodings: raise TypeError("Cannot set an encoding with a unicode input, set %r" % encodings) return HTMLUnicodeInputStream(source, **kwargs) else: return HTMLBinaryInputStream(source, **kwargs) class HTMLUnicodeInputStream(object): """Provides a unicode stream of characters to the HTMLTokenizer. This class takes care of character encoding and removing or replacing incorrect byte-sequences and also provides column and line tracking. """ _defaultChunkSize = 10240 def __init__(self, source): """Initialises the HTMLInputStream. HTMLInputStream(source, [encoding]) -> Normalized stream from source for use by html5lib. source can be either a file-object, local filename or a string. The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element) """ if not _utils.supports_lone_surrogates: # Such platforms will have already checked for such # surrogate errors, so no need to do this checking. self.reportCharacterErrors = None elif len("\U0010FFFF") == 1: self.reportCharacterErrors = self.characterErrorsUCS4 else: self.reportCharacterErrors = self.characterErrorsUCS2 # List of where new lines occur self.newLines = [0] self.charEncoding = (lookupEncoding("utf-8"), "certain") self.dataStream = self.openStream(source) self.reset() def reset(self): self.chunk = "" self.chunkSize = 0 self.chunkOffset = 0 self.errors = [] # number of (complete) lines in previous chunks self.prevNumLines = 0 # number of columns in the last line of the previous chunk self.prevNumCols = 0 # Deal with CR LF and surrogates split over chunk boundaries self._bufferedCharacter = None def openStream(self, source): """Produces a file object from source. source can be either a file object, local filename or a string. """ # Already a file object if hasattr(source, 'read'): stream = source else: stream = StringIO(source) return stream def _position(self, offset): chunk = self.chunk nLines = chunk.count('\n', 0, offset) positionLine = self.prevNumLines + nLines lastLinePos = chunk.rfind('\n', 0, offset) if lastLinePos == -1: positionColumn = self.prevNumCols + offset else: positionColumn = offset - (lastLinePos + 1) return (positionLine, positionColumn) def position(self): """Returns (line, col) of the current position in the stream.""" line, col = self._position(self.chunkOffset) return (line + 1, col) def char(self): """ Read one character from the stream or queue if available. Return EOF when EOF is reached. """ # Read a new chunk from the input stream if necessary if self.chunkOffset >= self.chunkSize: if not self.readChunk(): return EOF chunkOffset = self.chunkOffset char = self.chunk[chunkOffset] self.chunkOffset = chunkOffset + 1 return char def readChunk(self, chunkSize=None): if chunkSize is None: chunkSize = self._defaultChunkSize self.prevNumLines, self.prevNumCols = self._position(self.chunkSize) self.chunk = "" self.chunkSize = 0 self.chunkOffset = 0 data = self.dataStream.read(chunkSize) # Deal with CR LF and surrogates broken across chunks if self._bufferedCharacter: data = self._bufferedCharacter + data self._bufferedCharacter = None elif not data: # We have no more data, bye-bye stream return False if len(data) > 1: lastv = ord(data[-1]) if lastv == 0x0D or 0xD800 <= lastv <= 0xDBFF: self._bufferedCharacter = data[-1] data = data[:-1] if self.reportCharacterErrors: self.reportCharacterErrors(data) # Replace invalid characters data = data.replace("\r\n", "\n") data = data.replace("\r", "\n") self.chunk = data self.chunkSize = len(data) return True def characterErrorsUCS4(self, data): for _ in range(len(invalid_unicode_re.findall(data))): self.errors.append("invalid-codepoint") def characterErrorsUCS2(self, data): # Someone picked the wrong compile option # You lose skip = False for match in invalid_unicode_re.finditer(data): if skip: continue codepoint = ord(match.group()) pos = match.start() # Pretty sure there should be endianness issues here if _utils.isSurrogatePair(data[pos:pos + 2]): # We have a surrogate pair! char_val = _utils.surrogatePairToCodepoint(data[pos:pos + 2]) if char_val in non_bmp_invalid_codepoints: self.errors.append("invalid-codepoint") skip = True elif (codepoint >= 0xD800 and codepoint <= 0xDFFF and pos == len(data) - 1): self.errors.append("invalid-codepoint") else: skip = False self.errors.append("invalid-codepoint") def charsUntil(self, characters, opposite=False): """ Returns a string of characters from the stream up to but not including any character in 'characters' or EOF. 'characters' must be a container that supports the 'in' method and iteration over its characters. """ # Use a cache of regexps to find the required characters try: chars = charsUntilRegEx[(characters, opposite)] except KeyError: if __debug__: for c in characters: assert(ord(c) < 128) regex = "".join(["\\x%02x" % ord(c) for c in characters]) if not opposite: regex = "^%s" % regex chars = charsUntilRegEx[(characters, opposite)] = re.compile("[%s]+" % regex) rv = [] while True: # Find the longest matching prefix m = chars.match(self.chunk, self.chunkOffset) if m is None: # If nothing matched, and it wasn't because we ran out of chunk, # then stop if self.chunkOffset != self.chunkSize: break else: end = m.end() # If not the whole chunk matched, return everything # up to the part that didn't match if end != self.chunkSize: rv.append(self.chunk[self.chunkOffset:end]) self.chunkOffset = end break # If the whole remainder of the chunk matched, # use it all and read the next chunk rv.append(self.chunk[self.chunkOffset:]) if not self.readChunk(): # Reached EOF break r = "".join(rv) return r def unget(self, char): # Only one character is allowed to be ungotten at once - it must # be consumed again before any further call to unget if char is not None: if self.chunkOffset == 0: # unget is called quite rarely, so it's a good idea to do # more work here if it saves a bit of work in the frequently # called char and charsUntil. # So, just prepend the ungotten character onto the current # chunk: self.chunk = char + self.chunk self.chunkSize += 1 else: self.chunkOffset -= 1 assert self.chunk[self.chunkOffset] == char class HTMLBinaryInputStream(HTMLUnicodeInputStream): """Provides a unicode stream of characters to the HTMLTokenizer. This class takes care of character encoding and removing or replacing incorrect byte-sequences and also provides column and line tracking. """ def __init__(self, source, override_encoding=None, transport_encoding=None, same_origin_parent_encoding=None, likely_encoding=None, default_encoding="windows-1252", useChardet=True): """Initialises the HTMLInputStream. HTMLInputStream(source, [encoding]) -> Normalized stream from source for use by html5lib. source can be either a file-object, local filename or a string. The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element) """ # Raw Stream - for unicode objects this will encode to utf-8 and set # self.charEncoding as appropriate self.rawStream = self.openStream(source) HTMLUnicodeInputStream.__init__(self, self.rawStream) # Encoding Information # Number of bytes to use when looking for a meta element with # encoding information self.numBytesMeta = 1024 # Number of bytes to use when using detecting encoding using chardet self.numBytesChardet = 100 # Things from args self.override_encoding = override_encoding self.transport_encoding = transport_encoding self.same_origin_parent_encoding = same_origin_parent_encoding self.likely_encoding = likely_encoding self.default_encoding = default_encoding # Determine encoding self.charEncoding = self.determineEncoding(useChardet) assert self.charEncoding[0] is not None # Call superclass self.reset() def reset(self): self.dataStream = self.charEncoding[0].codec_info.streamreader(self.rawStream, 'replace') HTMLUnicodeInputStream.reset(self) def openStream(self, source): """Produces a file object from source. source can be either a file object, local filename or a string. """ # Already a file object if hasattr(source, 'read'): stream = source else: stream = BytesIO(source) try: stream.seek(stream.tell()) except: # pylint:disable=bare-except stream = BufferedStream(stream) return stream def determineEncoding(self, chardet=True): # BOMs take precedence over everything # This will also read past the BOM if present charEncoding = self.detectBOM(), "certain" if charEncoding[0] is not None: return charEncoding # If we've been overriden, we've been overriden charEncoding = lookupEncoding(self.override_encoding), "certain" if charEncoding[0] is not None: return charEncoding # Now check the transport layer charEncoding = lookupEncoding(self.transport_encoding), "certain" if charEncoding[0] is not None: return charEncoding # Look for meta elements with encoding information charEncoding = self.detectEncodingMeta(), "tentative" if charEncoding[0] is not None: return charEncoding # Parent document encoding charEncoding = lookupEncoding(self.same_origin_parent_encoding), "tentative" if charEncoding[0] is not None and not charEncoding[0].name.startswith("utf-16"): return charEncoding # "likely" encoding charEncoding = lookupEncoding(self.likely_encoding), "tentative" if charEncoding[0] is not None: return charEncoding # Guess with chardet, if available if chardet: try: from chardet.universaldetector import UniversalDetector except ImportError: pass else: buffers = [] detector = UniversalDetector() while not detector.done: buffer = self.rawStream.read(self.numBytesChardet) assert isinstance(buffer, bytes) if not buffer: break buffers.append(buffer) detector.feed(buffer) detector.close() encoding = lookupEncoding(detector.result['encoding']) self.rawStream.seek(0) if encoding is not None: return encoding, "tentative" # Try the default encoding charEncoding = lookupEncoding(self.default_encoding), "tentative" if charEncoding[0] is not None: return charEncoding # Fallback to html5lib's default if even that hasn't worked return lookupEncoding("windows-1252"), "tentative" def changeEncoding(self, newEncoding): assert self.charEncoding[1] != "certain" newEncoding = lookupEncoding(newEncoding) if newEncoding is None: return if newEncoding.name in ("utf-16be", "utf-16le"): newEncoding = lookupEncoding("utf-8") assert newEncoding is not None elif newEncoding == self.charEncoding[0]: self.charEncoding = (self.charEncoding[0], "certain") else: self.rawStream.seek(0) self.charEncoding = (newEncoding, "certain") self.reset() raise ReparseException("Encoding changed from %s to %s" % (self.charEncoding[0], newEncoding)) def detectBOM(self): """Attempts to detect at BOM at the start of the stream. If an encoding can be determined from the BOM return the name of the encoding otherwise return None""" bomDict = { codecs.BOM_UTF8: 'utf-8', codecs.BOM_UTF16_LE: 'utf-16le', codecs.BOM_UTF16_BE: 'utf-16be', codecs.BOM_UTF32_LE: 'utf-32le', codecs.BOM_UTF32_BE: 'utf-32be' } # Go to beginning of file and read in 4 bytes string = self.rawStream.read(4) assert isinstance(string, bytes) # Try detecting the BOM using bytes from the string encoding = bomDict.get(string[:3]) # UTF-8 seek = 3 if not encoding: # Need to detect UTF-32 before UTF-16 encoding = bomDict.get(string) # UTF-32 seek = 4 if not encoding: encoding = bomDict.get(string[:2]) # UTF-16 seek = 2 # Set the read position past the BOM if one was found, otherwise # set it to the start of the stream if encoding: self.rawStream.seek(seek) return lookupEncoding(encoding) else: self.rawStream.seek(0) return None def detectEncodingMeta(self): """Report the encoding declared by the meta element """ buffer = self.rawStream.read(self.numBytesMeta) assert isinstance(buffer, bytes) parser = EncodingParser(buffer) self.rawStream.seek(0) encoding = parser.getEncoding() if encoding is not None and encoding.name in ("utf-16be", "utf-16le"): encoding = lookupEncoding("utf-8") return encoding class EncodingBytes(bytes): """String-like object with an associated position and various extra methods If the position is ever greater than the string length then an exception is raised""" def __new__(self, value): assert isinstance(value, bytes) return bytes.__new__(self, value.lower()) def __init__(self, value): # pylint:disable=unused-argument self._position = -1 def __iter__(self): return self def __next__(self): p = self._position = self._position + 1 if p >= len(self): raise StopIteration elif p < 0: raise TypeError return self[p:p + 1] def next(self): # Py2 compat return self.__next__() def previous(self): p = self._position if p >= len(self): raise StopIteration elif p < 0: raise TypeError self._position = p = p - 1 return self[p:p + 1] def setPosition(self, position): if self._position >= len(self): raise StopIteration self._position = position def getPosition(self): if self._position >= len(self): raise StopIteration if self._position >= 0: return self._position else: return None position = property(getPosition, setPosition) def getCurrentByte(self): return self[self.position:self.position + 1] currentByte = property(getCurrentByte) def skip(self, chars=spaceCharactersBytes): """Skip past a list of characters""" p = self.position # use property for the error-checking while p < len(self): c = self[p:p + 1] if c not in chars: self._position = p return c p += 1 self._position = p return None def skipUntil(self, chars): p = self.position while p < len(self): c = self[p:p + 1] if c in chars: self._position = p return c p += 1 self._position = p return None def matchBytes(self, bytes): """Look for a sequence of bytes at the start of a string. If the bytes are found return True and advance the position to the byte after the match. Otherwise return False and leave the position alone""" p = self.position data = self[p:p + len(bytes)] rv = data.startswith(bytes) if rv: self.position += len(bytes) return rv def jumpTo(self, bytes): """Look for the next sequence of bytes matching a given sequence. If a match is found advance the position to the last byte of the match""" newPosition = self[self.position:].find(bytes) if newPosition > -1: # XXX: This is ugly, but I can't see a nicer way to fix this. if self._position == -1: self._position = 0 self._position += (newPosition + len(bytes) - 1) return True else: raise StopIteration class EncodingParser(object): """Mini parser for detecting character encoding from meta elements""" def __init__(self, data): """string - the data to work on for encoding detection""" self.data = EncodingBytes(data) self.encoding = None def getEncoding(self): methodDispatch = ( (b"<!--", self.handleComment), (b"<meta", self.handleMeta), (b"</", self.handlePossibleEndTag), (b"<!", self.handleOther), (b"<?", self.handleOther), (b"<", self.handlePossibleStartTag)) for _ in self.data: keepParsing = True for key, method in methodDispatch: if self.data.matchBytes(key): try: keepParsing = method() break except StopIteration: keepParsing = False break if not keepParsing: break return self.encoding def handleComment(self): """Skip over comments""" return self.data.jumpTo(b"-->") def handleMeta(self): if self.data.currentByte not in spaceCharactersBytes: # if we have <meta not followed by a space so just keep going return True # We have a valid meta element we want to search for attributes hasPragma = False pendingEncoding = None while True: # Try to find the next attribute after the current position attr = self.getAttribute() if attr is None: return True else: if attr[0] == b"http-equiv": hasPragma = attr[1] == b"content-type" if hasPragma and pendingEncoding is not None: self.encoding = pendingEncoding return False elif attr[0] == b"charset": tentativeEncoding = attr[1] codec = lookupEncoding(tentativeEncoding) if codec is not None: self.encoding = codec return False elif attr[0] == b"content": contentParser = ContentAttrParser(EncodingBytes(attr[1])) tentativeEncoding = contentParser.parse() if tentativeEncoding is not None: codec = lookupEncoding(tentativeEncoding) if codec is not None: if hasPragma: self.encoding = codec return False else: pendingEncoding = codec def handlePossibleStartTag(self): return self.handlePossibleTag(False) def handlePossibleEndTag(self): next(self.data) return self.handlePossibleTag(True) def handlePossibleTag(self, endTag): data = self.data if data.currentByte not in asciiLettersBytes: # If the next byte is not an ascii letter either ignore this # fragment (possible start tag case) or treat it according to # handleOther if endTag: data.previous() self.handleOther() return True c = data.skipUntil(spacesAngleBrackets) if c == b"<": # return to the first step in the overall "two step" algorithm # reprocessing the < byte data.previous() else: # Read all attributes attr = self.getAttribute() while attr is not None: attr = self.getAttribute() return True def handleOther(self): return self.data.jumpTo(b">") def getAttribute(self): """Return a name,value pair for the next attribute in the stream, if one is found, or None""" data = self.data # Step 1 (skip chars) c = data.skip(spaceCharactersBytes | frozenset([b"/"])) assert c is None or len(c) == 1 # Step 2 if c in (b">", None): return None # Step 3 attrName = [] attrValue = [] # Step 4 attribute name while True: if c == b"=" and attrName: break elif c in spaceCharactersBytes: # Step 6! c = data.skip() break elif c in (b"/", b">"): return b"".join(attrName), b"" elif c in asciiUppercaseBytes: attrName.append(c.lower()) elif c is None: return None else: attrName.append(c) # Step 5 c = next(data) # Step 7 if c != b"=": data.previous() return b"".join(attrName), b"" # Step 8 next(data) # Step 9 c = data.skip() # Step 10 if c in (b"'", b'"'): # 10.1 quoteChar = c while True: # 10.2 c = next(data) # 10.3 if c == quoteChar: next(data) return b"".join(attrName), b"".join(attrValue) # 10.4 elif c in asciiUppercaseBytes: attrValue.append(c.lower()) # 10.5 else: attrValue.append(c) elif c == b">": return b"".join(attrName), b"" elif c in asciiUppercaseBytes: attrValue.append(c.lower()) elif c is None: return None else: attrValue.append(c) # Step 11 while True: c = next(data) if c in spacesAngleBrackets: return b"".join(attrName), b"".join(attrValue) elif c in asciiUppercaseBytes: attrValue.append(c.lower()) elif c is None: return None else: attrValue.append(c) class ContentAttrParser(object): def __init__(self, data): assert isinstance(data, bytes) self.data = data def parse(self): try: # Check if the attr name is charset # otherwise return self.data.jumpTo(b"charset") self.data.position += 1 self.data.skip() if not self.data.currentByte == b"=": # If there is no = sign keep looking for attrs return None self.data.position += 1 self.data.skip() # Look for an encoding between matching quote marks if self.data.currentByte in (b'"', b"'"): quoteMark = self.data.currentByte self.data.position += 1 oldPosition = self.data.position if self.data.jumpTo(quoteMark): return self.data[oldPosition:self.data.position] else: return None else: # Unquoted value oldPosition = self.data.position try: self.data.skipUntil(spaceCharactersBytes) return self.data[oldPosition:self.data.position] except StopIteration: # Return the whole remaining value return self.data[oldPosition:] except StopIteration: return None def lookupEncoding(encoding): """Return the python codec name corresponding to an encoding or None if the string doesn't correspond to a valid encoding.""" if isinstance(encoding, binary_type): try: encoding = encoding.decode("ascii") except UnicodeDecodeError: return None if encoding is not None: try: return webencodings.lookup(encoding) except AttributeError: return None else: return None