Glynx - a download manager.
Download from http://www.ipct.pucrs.br/flavio/glynx/glynx-latest.pl
Glynx makes a local image of a selected part of the internet.
It can be used to make download lists to be used with other download managers, making a distributed download process.
It currently supports resume, retry, referer, user-agent, java, frames,
distributed download (see --slave, --stop, --restart).
It partially supports redirect, javascript, multimedia, authentication, mirror
It does not support forms
It has not been tested with ``https'' yet.
It should be better tested with ``ftp''.
Tested on Linux and NT
$progname.pl [options] <URL>
$progname.pl [options] --dump="dump-file" <URL>
$progname.pl [options] "download-list-file"
$progname.pl [options] --dump="dump-file" <URL>
$progname.pl [options] --slave
How to make a default configuration:
Start the program with all command-line configurations, plus --cfg-save
or:
1 - start the program with --cfg-save
2 - edit glynx.ini file
--subst, --exclude and --loop use regular expressions.
http://www.site.com/old.htm --subst=s/old/new/ downloads: http://www.acme.com/new.htm
- Note: the substitution string MUST be made of "valid URL" characters
--exclude=/\.gif/ will not download ".gif" files
- Note: Multiple --exclude are allowed:
--exclude=/gif/ --exclude=/jpeg/ will not download ".gif" or ".jpeg" files
It can also be written as: --exclude=/\.gif|\.jp.?g/i matching .gif, .GIF, .jpg, .jpeg, .JPG, .JPEG
--exclude=/www\.site\.com/ will not download links containing the site name
http://www.site.com/bin/index.htm --prefix=http://www.site.com/bin/ won't download outside from directory "/bin". Prefix must end with a slash "/".
http://www.site.com/index%%%.htm --loop=%%%:0..3
will download:
http://www.site.com/index0.htm
http://www.site.com/index1.htm
http://www.site.com/index2.htm
http://www.site.com/index3.htm
- Note: the substitution string MUST be made of "valid URL" characters
- For multiple exclusion: use ``|''.
- Don't read directory-index:
?D=D ?D=A ?S=D ?S=A ?M=D ?M=A ?N=D ?N=A => \?[DSMN]=[AD]
To change default "exclude" pattern - put it in the configuration file
Note: ``File:'' item in dump file is ignored
You can filter the processing of a dump file using --prefix, --exclude, --subst
If after finishing downloading you still have ``.PART._BUSY_'' files in the base directory, rename them to ``.PART'' (the program should do this by itself)
Don't do this: --depth=1 --out-depth=3 because ``out-depth'' is an upper limit; it is tested after depth is generated. The right way is: --depth=4 --out-depth=3
This will do nothing:
--dump=x graphic.gif
because the dump file gets all binary files.
Errors using https:
[ ERROR 501 Protocol scheme 'https' is not supported => LATER ] or [ ERROR 501 Can't locate object method "new" via package "LWP::Protocol::https" => LATER ]
This means you need to install at least ``openssl'' (http://www.openssl.org), Net::SSLeay and IO::Socket::SSL
Very basic:
--version Print version number ($VERSION) and quit --verbose More output --quiet No output --help Help page --cfg-save Save configuration to file "$CFG_FILE" --base-dir=DIR Place to load/save files (default is "$BASE_DIR")
Download options are:
--sleep=SECS Sleep between gets, ie. go slowly (default is $SLEEP)
--prefix=PREFIX Limit URLs to those which begin with PREFIX (default is URL base)
Multiple "--prefix" are allowed.
--depth=N Maximum depth to traverse (default is $DEPTH)
--out-depth=N Maximum depth to traverse outside of PREFIX (default is $OUT_DEPTH)
--referer=URI Set initial referer header (default is "$REFERER")
--limit=N A limit on the number documents to get (default is $MAX_DOCS)
--retry=N Maximum number of retrys (default is $RETRY_MAX)
--timeout=SECS Timeout value - increases on retrys (default is $TIMEOUT)
--agent=AGENT User agent name (default is "$AGENT")
--mirror Checks all existing files for updates (default is --nomirror)
--mediaext Creates a file link, guessing the media type extension (.jpg, .gif)
(Windows perl makes a file copy) (default is --nomediaext)
Multi-process control:
--slave Wait until a download-list file is created (be a slave) --stop Stop slave --restart Stop and restart slave
Not implemented yet but won't generate fatal errors (compatibility with lwp-rget):
--auth=USER:PASS Set authentication credentials for web site
--hier Download into hierarchy (not all files into cwd)
--iis Workaround IIS 2.0 bug by sending "Accept: */*" MIME
header; translates backslashes (\) to forward slashes (/)
--keepext=type Keep file extension for MIME types (comma-separated list)
--nospace Translate spaces URLs (not #fragments) to underscores (_)
--tolower Translate all URLs to lowercase (useful with IIS servers)
Other options: (to-be better explained)
--indexfile=FILE Index file in a directory (default is "$INDEXFILE")
--part-suffix=.SUFFIX (default is "$PART_SUFFIX") (eg: ".Getright" ".PART")
--dump=FILE (default is "$DUMP") make download-list file,
to be used later
--dump-max=N (default is $DUMP_MAX) number of links per download-list file
--invalid-char=C (default is "$INVALID_CHAR")
--exclude=/REGEXP/i (default is "@EXCLUDE") Don't download matching URLs
Multiple --exclude are allowed
--loop=REGEXP:INITIAL..FINAL (default is "$LOOP") (eg: xx:a,b,c xx:'01'..'10')
--subst=s/REGEXP/VALUE/i (default is "$show_subst") (obs: "\" deve ser escrito "\\")
--404-retry will retry on error 404 Not Found (default).
--no404-retry creates an empty file on error 404 Not Found.
More command-line compatibility with lwp-rget
Graphical user interface
Glynx - a download manager.
Installation:
Windows:
- unzip to a directory, such as c:\glynx or even c:\temp
- this is a DOS script, it will not work properly if you double click it.
However, you can put it in the startup directory in "slave mode"
making a link with the --slave parameter. Then open another DOS window
to operate it as a client.
- the latest ActivePerl has all the modules needed, except for https.
Unix/Linux:
make a subdirectory and cd to it
tar -xzf Glynx-[version].tar.gz
chmod +x glynx.pl (if necessary)
pod2html glynx.pl -o=glynx.htm (this is optional)
- under RedHat 6.2 I had to upgrade or install these modules:
HTML::Tagset MIME:Base64 URI HTML::Parser Digest::MD5 libnet libwww-perl
- to use https you will need:
openssl (www.openssl.org) Net::SSLeay IO::Socket::SSL
Please note that the software will create many files in
its work directory, so it is advisable to have a dedicated
sub-directory for it.
Goals:
generalize
option to use (external) java and other script languages to extract links
configurable file names and suffixes
configurable dump file format
plugins
more protocols; download streams
language support
adhere to perl standards
pod documentation
distribution
difficult to understand, fun to write
parallelize things and multiple computer support
cpu and memory optimizations
accept hardware/internet failures
restartable
reduce internet traffic
minimize requests
cache everything
other (from perlhack.pod)
1. Keep it fast, simple, and useful.
2. Keep features/concepts as orthogonal as possible.
3. No arbitrary limits (platforms, data sizes, cultures).
4. Keep it open and exciting to use/patch/advocate Perl everywhere.
5. Either assimilate new technologies, or build bridges to them.
Problems (not bugs):
- It takes some time to start the program; not practical for small single file downloads.
- Command line only. It should have a graphical front-end; there exists a web front-end.
- Hard to install if you don't have Perl or have outdated Perl modules. It works fine
with Perl 5.6 modules.
- slave mode uses "dump files", and doesn't delete them.
To-do (long list):
Bugs/debug/testing:
- test: timeout changes after "slave"
- test: counting MAX_DOCS with retry
- test: base-dir, out-depth, site leakage
- test: authentication
- test: redirect 3xx
usar: www.ig.com.br ?
- perl "link" is copying instead of linking, even on linux
- 401 - auth required -- supply name:pass
- implement "If-Range:"
- put // on exclude, etc if they don't have
- arrays for $LOOP,$SUBST; accept multiple URL
- Doesn't recreate unix links on "ftp".
Should do that instead of duplicating files (same on http redirects).
- uses Accept:text/html to ask for an html listing of the directory when
in "ftp" mode. This will have to be changed to "text/ftp-dir-listing" if
we implement unix links.
- install and test "https"
- accept --url=http://...
- accept --batch=...grx
- ignore/accept comments: <! a href="..."> - nested comments???
- http server to make distributed downloads across the internet
- use eval to avoid fatal errors; test for valid protocols
- rename "old" .grx._BUSY_ files to .grx (timeout = 1 day?)
option: touch busy file to show activity
- don't ignore "File:"
- unknown protocol is a fatal error
- change the retry loop to a "while"
- leitura da configuracao:
(1) le opcoes da linha de comando (pode trocar o arquivo .ini),
(2) le configuracao .ini,
(3) le opcoes da linha de comando de novo (pode ser override .ini),
(4) le download-list-file
(5) le opcoes da linha de comando de novo (pode ser override download-list-file)
- execute/override download-list-file "File:"
opcao: usar --subst=/k:\\temp/c:\\download/
Generalization, user-interface:
- arquivo de log opcional para guardar os headers.
Opcao: filename._HEADER_; --log-headers
- make it a Perl module (crawler, robot?), generic, re-usable
- option to understand robot-rules
- make .glynx the default suffix for everything
- try to support <form> through download-list-file
- internal small javascript interpreter
- perl/tk front-end; finish web front end
- config comment-string in download-list-file
- config comment/uncomment for directives
- arquivo default para dump sem parametros - "dump-[n]-1"?
- more configuration parameters
- opcao portugues/ingles?
- plugins: for each chunk, page, link, new site, level change, dump file change,
max files, on errors, retry level change. Opcao: usar callbacks.
- dump suffix option
- javascript interpreter option
- scripting option (execute sequentially instead of parallel)
- use environment
- aceitar configuracao --nofollow="shtml" e --follow="xxx"
- controle de hora, bytes por segundo
- protocolo pnm: - realvideo, arquivos .rpm
- streams
- gnutella
- 401 Authentication Required, generalize abort-on-error list,
support --auth= (see rget)
- opcao para reescrever paginas html com links relativos
Standards/perl:
- packaging for distribution, include rfcs, etc?
- include a default ini file in package
- include web front-end in package?
- installation hints, package version problems (abs_url)
- more english writing
- include all lwp-rget options, or ignore without exiting
- criar um objeto para as listas de links - escolher e especializar um existente.
- check: 19.4.5 HTTP Header Fields in Multipart Body-Parts
Content-Encoding
Persistent connections: Connection-header
Accept: */*, *.*
- documentar melhor o uso de "\" em exclude e subst
- ler, enviar, configurar cookies
Network/parallel support:
- timed downloads - start/stop hours
- gravar arquivo "to-do" durante o processamento,
para poder retomar em caso de interrupcao.
ex: a cada 10 minutos
- Redo Web front-end
Speed optimizations:
- use an optional database connection
- Persistent connections;
- take a look at LWP::ParallelUserAgent
- take a look at LWPng for simultaneous file transfers
- take a look at LWP::Sitemapper
- use eval around things do speed up program loading
- opcao: pilhas diferentes dependendo do tipo de arquivo ou site, para acelerar a procura
Other:
- forms / PUT
- Renomear a extensao de acordo com o mime-type (ou copiar para o outro nome).
configuracao: --on-redirect=rename
--on-redirect=copy
--on-mime=rename
--on-mime=copy
- configurar tamanho maximo da URL
- configurar profundidade maxima de subdiretorios
- tamanho maximo do arquivo recebido
- disco cheio / alternate dir
- "--proxy=http:"1.1.1.1",ftp:"1.1.1.1"
"--proxy="1.1.1.1"
acessar proxy: $ua->proxy(...) Set/retrieve proxy URL for a scheme:
$ua->proxy(['http', 'ftp'], 'http://proxy.sn.no:8001/');
$ua->proxy('gopher', 'http://proxy.sn.no:8001/');
- enable "--no-[option]"
- accept empty "--dump" or "--no-dump" / "--nodump"
--max-mb=100
limita o tamanho total do download
--auth=USER:PASS
nao e' realmente necessario, pode estar dentro da URL
existe no lwp-rget
--nospace
permite links com espacos no nome (ver lwp-rget)
--relative-links
opcao para refazer os links para relativo
--include=".exe" --nofollow=".shtml" --follow=".htm"
opcoes de inclusao de arquivos (procurar links dentro)
--full ou --depth=full
opcao site inteiro
--chunk=128000
--dump-all
grava todos os links, incluindo os ja existentes e paginas processadas
Version history:
1.023:
- better redirect, but perl "link" is copying instead of linking
- --mirror option (304)
- --mediaext option
- sets file dates to last-modified
1.022:
- multiple --prefix and --exclude seems to be working
- uses Accept:text/html to ask for an html listing of the directory when in "ftp" mode.
- corrected errors creating directory and copying file on linux
1.021:
- uses URI::Heuristic on command-line URL
- shows error response headers (if verbose)
- look at the 3rd parameter on 206 (when available -- otherwise it gives 500),
Content-Length: 637055 --> if "206" this is "chunk" size
Content-Range: bytes 1449076-2086130/2086131 --> THIS is file size
- prefix of: http://rd.yahoo.com/footer/?http://travel.yahoo.com/
should be: http://rd.yahoo.com/footer/
- included: "wav"
- sleep had 1 extra second
- sleep makes tests even when sleep==0
1.020: oct-02-2000
- optimization: accepts 200, when expecting 206
- don't keep retrying when there is nothing to do
- 404 Not Found error sometimes means "can't connect" - uses "--404-retry"
- file read = binmode
1.019: - restart if program was modified (-M $0)
- include "mov"
- stop, restart
1.018: - better copy, rename and unlink
- corrected binary dump when slave
- comparacao de tamanho de arquivos corrigida
- span e' um comando de css, que funciona como "a" (a href == span href);
span class is not java
1.017: - sleep prints dots if verbose.
- daemon mode (--slave)
- url and input file are optional
1.016: sept-27-2000
- new name "glynx.pl"
- verbose/quiet
- exponential timeout on retry
- storage control is a bit more efficient
- you can filter the processing of a dump file using prefix, exclude, subst
- more things in english, lots of new "to-do"; "goals" section
- rename config file to glynx.ini
1.015: - first published version, under name "get.pl"
- rotina unica de push/shift sem repeticao
- traduzido parcialmente para ingles, revisao das mensagens
1.014: - verifica inside antes de incluir o link
- corrige numeracao dos arquivos dump
- header "Location", "Content-Base"
- revisado "Content-Location"
1.013: - para otimizar: retirar repeticoes dentro da pagina
- incluido "png"
- cria/testa arquivo "not-found"
- processa Content-Location - TESTAR - achar um site que use
- incluido tipo "swf", "dcr" (shockwave) e "css" (style sheet)
- corrige http://host/../file gravado em ./host/../file => ./file
- retira caracteres estranhos vindos do javascript: ' ;
- os retrys pendentes sao gravados somente no final.
- (1) le opcoes, (2) le configuracao, (3) le opcoes de novo
1.012: - segmenta o arquivo dump durante o processamento, permitindo iniciar o
download em paralelo a partir de outro processo/computador antes que a tarefa esteja
totalmente terminada
- utiliza indice para gravar o dump; nao destroi a lista que esta na memoria.
- salva a configuracao completa junto com o dump;
- salva/le get.ini
1.011: corrige autenticacao (prefix)
corrige dump
le dump
salva/le $OUT_DEPTH, depth (individual), prefix no arquivo dump
1.010: resume
se o site nao tem resume, tenta de novo e escolhe o melhor resultado (ideia do Silvio)
1.009: 404 not found nao enviado para o dump
processa arquivo se o tipo mime for text/html (nao funciona para o cache)
muda o referer dos links dependendo da base da resposta (redirect)
considera arquivos de tamanho zero como "nao no cache"
gera nome _INDEX_.HTM quando o final da URL tem "/".
1.008: trabalha internamente com URL absolutas
corrige vazamento quando out-nivel=0
1.007: segmenta o arquivo dump
acelera a procura em @processed
corrige o nome do diretorio no arquivo dump
Other problems - design decisions to make
- se usar '' no eval nao precisa de \\ ? - paginas html redirecionadas devem receber um tag <BASE> no texto? - montar links usando java ? - a biblioteca perl faz sozinha Redirection 3xx ? - usar File::Path para criar diretorios ? - applets sempre tem .class no fim? - file names excessivamente longos - o que fazer? - usar: $ua->max_size([$bytes]) - nao funciona com callback - mudar o filename se a base da resposta e diferente? - criar arquivo PART com tamanho zero quando da erro 408 - timeout - como e' o formato dump do go!zilla?
Copyright (c) 2000 Flavio Glock <fglock@pucrs.br> All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. This program was based on examples in the Perl distribution.
If you use it/like it, send a postcard to the author.